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ABSTRACT 

Guidance is offered on the creation and use of 
alternative assessment; and a process model is presented that links 
assessment with curriculum and instruction, based on contemporary 
theories of learning and cognition. The introductory chapter, 
"Rethinking Assessment," provides background cn the purposes of 
assessment and the need for new alternatives, plus an overview of key 
assessment development issues. Linking assessment and instruction is 
the focus of Chapter 2, which also reviews current trends in 
assessment. Chapter 3 considers determining the purpose of the 
assessment, and Chapter 4 reviews selecting assessment tasks and 
matching them to student outcomes. Setting the criteria for judging 
student performance is discussed in Chapter 5. Chapter 6 reviews the 
steps necessary to ensure reliable scoring. Chapter 7 makes the 
important point that assessment is not an end in itself, but rather a 
tool for decision making. In this context, reliability and validity 
of assessments are discussed. There are 26 figures illustrating the 
discussion. (SLD) 
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Foreword 



The caveat "Not everything that counts can be counted and not every- 
thing that can be counted counts" was reportedly posted on Albert 
Einstein's office wall. In the context of present educational reform 
discussions, this almost prophetic statement has implications for the 
assessment of student learning. 

Assessment has become the focus of our nation's current educational 
reform agenda. Although our dialogue on authentic assessment has been 
elevated beyond the measurement of purely quantifiable or "countable" 
demonstrations of complex human performances, we have lacked a 
comprehensive, systematic, and integrated framework to assist practi- 
tioners in designing and developing alternative assessments. 

In A Practical Guide to Alternative Assessment, Joan Herman, Pamela 
Aschbacher, and Lynn Winters offer cogent guidance on the creation and 
use of alternative measures of student achievement. They present a 
systematic, integrative, and iterative process model that links assess- 
ment with curriculum and instruction, based on contemporary theories 
of learning and cognition. 

The authors review the purposes of assessment and provide a sub- 
stantive rationale for alternative structures. Yet, as they point out, the 
heart of th'j book is the illumination of several key assessment issues that 
reaffirm our knowledge that assessment tasks must be informed by the 
most important elements of instructional practice. These issues include: 

1. Assessment must be congruent with significant instructional 
goals. 

2. Assessment must involve the examination of the processes as 
well as the products of learning. 
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3. Performance-based activities do not constitute assessment per 
se. 

4. Cognitive learning theory and its constructivist approach to 
knowledge acquisition supports the need to integrate assessment 
methodologies with instructional outcomes and curriculum 
content. 

5. An integrated and active view of student learning requires the 
assessment of holistic and complex performance. 

6. Assessment design is dependent cn assessment purpose; grading 
and monitoring student progress are distinct from diagnosis and 
improvement. 

7. The key to effective assessment is the match between the task 
and the intended student outcome. 

8. The criteria used to evaluate student performance are critical: in 
the absence of criteria, assessment remains an isolated and 
episodic activity. 

9. Quality assessment provides substantive data for making in- 
formed decisions about student learning. 

10. Assessment systems that provide the most comprehensive feed- 
back on student growth include multiple measures taken over 
time. 

The word "assess" comes from the French "assidere," which means "to 
sit beside," By clarifying the critical conceptual and technical aspects of 
using alternative assessments, the authors have reaffirmed the funda- 
mental role of assessment, which is to provide authentic and meaningful 
feedback for improving student learning, instructional practice, and 
educational options. 

As the authors state, assessment is not an end in itself. It is a process 
that facilitates appropriate instructional decision making by providing 
information on two fundamental questions: How are we doing? and How 
can we do it better? 

Perhaps the best way to answer those questions is to sit beside the 
learner and find out. Now that's an interesting alternative! 

Stkphanie Pace Marshall 
ASCD President, 1992-93 
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Rethinking Assessment 



Assessment is a cornerstone of education reform in the '90s: the Presi- 
dent's education agenda, America 2000; the National Education Goals 
set by the governors; concerns for international competitiv^eness; re- 
newed calls for restructuring and accountability at the state, local, and 
school levels. These potent, highly visible initiatives ask educators and 
the nation to focus on high-level goals for our children. They ask that we 
set our sights on excellence and track our progress toward attaining it 
for individual students, for schools, for districts, for states, and for the 
nation. Requiring us to assess progress, they often pose assessment itself 
as a key to attaining such progress, thus ensuring assessiAent's priority 
status in schools. 

Yet this heightened emphasis on assessment comes at a time of 
growing dissatisfaction with traditional, multiple-choice forms of test- 
ing. The result is an explosion of interest in alternative forms of assess- 
ment combined with attempts across the country at all levels — national, 
state, local, and classroom — to create them. Talk of portfolios, exhibits, 
hands-on experiments, and writing-across-the-curriculum abounds. De- 
spite numerous conferences and meetings on these topics, educators 
have had little concrete guidance in the creation and use of alternative 
assessments, 

S 
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This book is intended to contribute to the process of creating alterna- 
tive forms of assessment. It is written for preservice and practicing 
teachers, school administrators, and district- and state-level practitio- 
ners who are interested in developing new kinds of assessments. Based 
on cmTent vidws of meaningful learning and curriculum as w^ell as both 
established and evolving principles of measurement quality, this book 
provides a systematic approach to assessment design and raises issues 
critical to ensuring high-quality assessments. In this introductory chap- 
ter, we provide background on the purposes of assessment and the need 
for new alternatives, plus an overview of the key assessment develop- 
ment issues, which constitute the heart of the book. 

It is important to note also what this book is not intended to do. It is 
not meant as a primer on how to plan and implement a comprehensive 
assessment system or on how to mount a total classroom assessment 
program. We emphasize key concerns in developing a single, good 
assessment, one crucial ingredient for sound assessment practices. 



Clarifying Terms 

Many terms are advanced when discussing alternatives to conventional, 
multiple-choice testing. These include alternative assessment, authentic 
assessment, and performance-based assessment. We use these terms 
synonymously to mean variants of performance assessments that require 
students to generate rather than choose a response. Performance assess- 
ment by any name requires students to actively accomplish complex and 
significant tasks, while bringing to bear prior knowledge, recent learn- 
ing, and relevant skills to solve realistic or authentic problems. Exhibi- 
tions, investigcitjons, demonstrations, written or oral responses, journals, 
and portfolios are examples of the assessment alternatives we think of 
when we use the term ''alternative assessment." 



Understanding the Promise of Assessment 

Why all the attention to testing and other assessments? Why do we need 
them? Assessment serves needs at all levels of the education hierarchy: 
for example, assessment helps educators set standards, create instruc- 
tion pathways, motivate performance, provide diagnostic feedback, as- 
sess/evaluate progress, and communicate progress to others. 

Whether we are teachers giving routine exarns in our classrooms or 
policymakers mandating achievement tests, through testing we set and 
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communicate silandards to those around us: We tell them what's impor- 
tant, what deserves focus, and what wo expect as good performance. In 
the process, significant stakes are often associated with test resuhs — 
classroom grades, college admission decisions, job security, self-satisfac- 
tion, and other perks — thus motivating performance. We not only 
communicate to students what's important by including a subject in a 
classroom test, we are also motivating students to learn it. Policymakers 
who mandate tests are suggesting what we should emphasize in the 
schools and are motivating us and our students to perform well on their 
tests. 

Similarly, feedback and progress monitoring functions of a.ssossmont 
work at several levels. For administrators and school planners, test 
results provide information about program effectiveness and identify 
areas of curricular strength and weakness, hi so doing they prove useful 
for resource allocation, for identifying staff development or materials 
needs, and for targeting and assessing plans for improvement. For 
teachers, testing provides important diagnostic information for instruc- 
tional groupings, for identifying instructional needs and prescribing 
appropriate instruction, for determining mastery, and for assessing the 
effectiveness of particular instructional units or approaches. For parents 
and students, testing information ir a gauge of individual progress, 
which helps them understand and build on individual strengths and 
weaknesses. 

For all, testing promises to answer the questions: "How am I [are ivel 
doing?" "How can I [wcj do better?'' 

Testing fulfills its promise only if it meets some critical conditions. 
Chief among these is the meaning of test performance: tests are useful 
and productive to the extent that they represent significant outcomes for 
students and the important goals of classroom in.structi()n. In other 
words, to be valid, fair, and useful, test content must match the knowl- 
edge, skills, and dispositions that teachers are teaching and those that 
students are expected to learn or acquire. 

Figure 1.1 is a simple model illustrating how assessment information 
can be used systematically to support and facilitate instructional im- 
provement. As the figure shows, schools and teachers generally synthe- 
size data from many sources to arrive at school or class goals for students. 
These sources include societal expectations, state and district curricu- 
lum frameworks, legal requirements, and available texts and other 
instructional materials, along with professional standards and profes- 
sional judgments. Once defined, these goals or outcomes serve as guide- 
posts for designing instruction and assessment. Because they reflect the 
same goals that direct instructional activities, assessment results guide 
instructional planning and serve as measures of instructional effective- 
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Figure 1.1 

An Integrated Model 
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noss. Assessment rnsults can bo used to iduntify areas where individuals 
may nvcA nioro help, whore additional class instruction is neodod, where 
Instructional units can be improved, where staff doveloiiment resources 
need to be targe^ted, and so forth. When instruction and assessment are 
hnkcid to a common sot of significant learning goals, assessments make 
sense and can be used to improve instruction. 

It is not that tests ought to dri\'e the curriculum, or that teachers ought 
to teach to the test. Rather, good QssessnwnO is an integral part of good 
instruction, Fioth testing and instruction ought to reflect significani, 
agreed on goals for students. Assessments should niaasinv important 
classroom objectives: assijssrnent results should represent how stucients 
perform on the broad knowledge and skill domains reflected by those 
objectives: and classroom instruction should provide students with the 
opportunity to learn and attain the knowledge and skills. 

Understanding the Limitations 
of Conventional Assessment 

Recent criticisms raise questions about the fit betu'oen the model shown 
in Figure l.l and existing testing practices. Do test scores represent 
significant learning outcomes? Do improvements in test score perform- 
ance actually represent improvements in learning (Clannell 1987, Linn 
et ol. 1990. Shepard 1989)? How is it possible that nearly all stales report 
scoring **above average" compared to a national norm group? The whole 
notion of "average" in comparison to a nationally representative norm 
group suggests that some will score below, some at, and some above 
average. Are improvements in test scores the result of improved teaching 
and learning, or do they reflect a meager curriculum with students being 
"drilled and killed" on expected test content? 

The litany goes on. Many people question whether current stand- 
ardized tests adequately represent important goals for student learning 
and development. Criticisms include the narrowness of test content that 
concentrates principally on bas^c skills in reading, language, and math: 
the mismatch between test content and curriculum and instruction; the 
overemphasis on routine and discrete skills with a neglect of complex 
thinking and problem solving: and the limited relevance of nniltiple- 



UVliile lusting and iisse^ssnionl am used more or lei.s synonymously in this book, wo 
lend to favor tho torni assossmcnt because it encourapos us to think boyond traditional 
tlcfinitions of testing. 
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choice formats to either classroom or real-world learning (Baker 1989, 
Shepard 1989, Herman and Golan 1990). Can educational programs 
guided by typical, standardized, multiple-choice testing produce mean- 
ingful outcomes? Critics of testing think not. 



Considering Alternatives 

Dissatisfaction with existing standardized testing coupled with un- 
abated faith in the value of systematic assessment have given rise to 
proposals for new assessment alternatives. Whether we call these alter- 
natives performance testing, authentic assessment, portfolio assessment, 
process testing, exhibits, or demonstrations, the hope is that they will 
better capture significant and enduring educational outcomes. While 
proposed assessment strategies may be diverse, they share a common 
vision (see Figure 1.2). 



Figure 1.2 

Common Characteristics in Alternative Assessments 



• Ask students to perform, create, produce, or do something. 

• Tap higher-ievei thinking and problem-solving skills. 

• Use tasks that represent meaningful instructional activities. 

• Invoke real-world applications. 

• People, not machines, do the scoring, using human judgment. 

• Require new instructional and assessment roles for teachers. 



Furthermore, these new assessments stress the importance of exam- 
ining the processes as well as the products of learning. They encourage 
us to move beyond the "one right answer" mentality and to challenge 
students to explore the possibilities inherent in open-ended, complex 
problems, and to draw their own inferences. 

Figure 1.3 shows the range of assessment alternatives currently being 
discuss'^d. While some are being heralded as new alternatives, they 
actually represent assessment techniques and issues that teachers have 
dealt with for years. Good teachers are always attuned to the process of 
instruction— how a lesson is going, who's having difficulty, who's paying 
attention, how a certain group is working— and adjust their instructional 
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plans and activities accordingly. Similarly, most teachers use a range of 
information sources to determine how well their students have learned. 
What is new about these assessments is that they make explicit and 
formal what was previously implicit and informal They also encourage 
teachers to articulate their instructional goals clearly, to ensure align- 
P/ent between their goals and current views of meaningful teaching and 
learning, and to gather systematic evidence to guide their instructional 
efforts. 





Figure 1.3 

Assessment Alternatives 




Assessing Processes 


Assessing Products 


• Clinical interviews 
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Essays with prompts and 
scoring criteria 


• Documented observations 


• 


Projects with rating criteria 


• Student learning logs and 
journals 




Student portfolios with rating 
criteria 


• Student self-evaluation 
(oral or written) 


• 


Student demonstrations/ 
investigations (expository or 
using the arts) 


• Debriefing interviews about 
student projects, prodUv is, and 
demonstrations {student explains 
what, why, and how, and reflects 
on possible changes 


• 


Paintings, drama, dances, and 
stories with rating criteria 


• Behavioral checklists 


• 


Attitude inventories, surveys 


• Student think-alouds in 

conjunction with standardized 
or multiple-choice tests 


• 


Standardized or multiple- 
choice tests, perhaps with 
section for "explanations" 
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Supporting Instructional Improvement 

Direct assessment of student writing illustrates the potential power of 
these new types of assessments: the integration of instruction and 
assessment, hi one district, teachers collaborated to define the attributes 
of good writing and developed a scoring scheme to capture the attributes. 
Other teachers were then trained in the reliable use of the scoring scheme 
and were used as raters in a districtwide writing assessment. Teachers 
found that the elements of the scoring rubric provided a good anchor for 
their instruction and gave them a fast and uniform way to assess and 
provide feedback for their students' classroom writing. Furthermore, the 
district's emphasis on writing and other state initiatives encouraged the 
teachers to change some aspects of their writing instruction. The result 
was improved student writing and teacher confidence in their instruc- 
tion and assessment. The development of performance tests in other 
content areas, with similar support for instructional change, shows 
similar promise. 



Knowing How to Proceed 
with Assessment Development 

Although alternative assessment implies new strategies for looking at 
educational outcomes, the process for developing these assessments is 
based on decades of measurement research. Developers of high-quality 
tests, be they norm-referenced, criterion-referenced, or performance- 
based tests, adhere to the following process with certain variations: 

1. Specify the nature of the skills and accomplishments students are 
to develop. 

2. Specify illustrative tasks that would require students to demon- 
strate these skills and accomplishments. 

3. Specifythecriteriaand standards for judging student performance 
on the task. 

4. Develop a reliable rating process. 

5. Gather evidence of validity to show what kinds of inferences can 
bo made from the assessment. 

6. Use test results to refine assessment and improve curriculum and 
instruction; provide feedback to students, parents, and the com- 
munity. 

lb 
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Subsequent chapters describe how this test development process 
applies to alternative assessment. The process itself is modified accord- 
ing to an assessment's purpose, no matter the format of assessment. For 
example, with large-scale assessment or minimum competency testing 
where stakes are high and one-shot assessment is typical, all steps are 
essential. For routine classroom assessment, when teachers have contin- 
ual opportunities to formally or informally assess student progress, steps 
four and five are less crucial. In the classroom, the results of any single 
assessment arn moderated by other forms of formal and informal evi- 
dence; this compensates for what may be lost by not gathering formal 
validity and reliability data. Nonetheless, teachers need to be well-ac- 
quainted with the characteristics of a technically sound assessment 
process so they can be wise consumers of the large-scale assessments 
and commercial products that influence their classroom practices. 



Balancing Assessment Strategies 

There is no one right way to assess students. Although wc present a 
strong case for performance assessment, we neither say that all assess- 
ments need to be of this type nor reject the use of multiple-choice and 
other forms of selected- response tests. We do affirm that performance 
assessments offer appealing ways to assess complex thinking and prob- 
lem-solving skills and, because they are grounded in realistic problems, 
are potentially more motivating and reinforcing for students. However, 
while performance assessments may tell us how well and deeply stu- 
dents can apply their knowledge, multiple-choice tests may be more 
efficient for determining how well students have acquired the basic facts 
and concepts. A balanced curriculum requires a balanced approach to 
assessment. 

Furthermore, just because an assessment asks students to perform an 
interesting or complex activity does not make it a good assessment. Good 
assessment reliably measures something beyond the specific tasks that 
students are asked to complete. The results of good assessment identify 
what students can do in a broad knowledge or skill domain. The skills 
that students exhibit in the assessment situation should transfer to other 
situations and other problems. 
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Holding Assessments to High Standards 

Regardless of the purpose or format, quality assessments should meet 
certain common standards. The Center for Research on Evaluation, 
Standards, and Student Testing (CRESST), (Linn, Baker, and Dunbar 
1991) has developed criteria that represent a touchstone throughout the 
assessment development process. The criteria include: 

■ Consequences. Testing history is full of examples of good inten- 
tions gone awry. This criterion requires that we plan from the 
outset to assess the actual consequences of the assessment. Does 
it have positive consequences or are there unintended effects 
such as narrowing of curriculum, adverse effects on disadvan- 
taged students, and so on? 

■ Fairness. Does the assessment consider fairly the cultural back- 
ground of those students taking the test? Have all students had 
equal opportunity to learn the complex thinking and problem- 
solving skills that are being targeted? 

■ Transfer and Generalizability. Will the assessment results sup- 
port accurate generalizations about student capability? Are the 
results reliable across raters, and consistent in meaning across 
locales? 

■ Cognitive Complexity. We cannot tell from simply looking at an 
assessment whether or not it actually assesses complex thinking 
skills. Does an assessment in fact require students to use complex 
thinking and problem solving? 

■ Content Quality. The tasks selected to measure a given content 
domain should themselves be worthy of students' and raters* time 
and efforts. Is the selected content consistent with the best current 
understanding of the field and does it reflect important aspects 
of a discipline that will stand the test of time? 

■ Content Coverage. The content coverage criterion requires that 
assessment be aligned with the curriculum and, over a set of 
assessments, represent the full curriculum. Because time con- 
straints will probably limit the number of alternative assessments 
that can be given, adequate content coverage represents a signifi- 
cant challenge. Are the key elements of the curriculum covered 
by the set of assessments? 

■ Meaningfulness. One of the rationales for more contextualized 
assessments is that they ensure that students engage in meaning- 
ful problems that result in worthwhile educational experiences 
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and higher levels of motivation. Do students find the assessment 
tasks realistic and worthwhile? 
■ Cost and Efficiency, To be effective tools, assessments must be 
cost effective. Labor-intensive performance-based assessments 
require efficient data collection and scoring procedures. Is the 
information about students worth the cost and time to obtain it? 

Finally, it is important to note that alternative assessment is a developing 
field. New strategies are evolving as are new methodologies for ensuring 
their quality. As we learn more about alternative assessment, current 
approaches may be refined or even reformulated. 
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Linking Assessment 
and Instruction 



New visions of effective curriculum, instruction, and learning demand 
new attention to systematic assessment. No longer is learning thought to 
be a one-way transmission from teacher to students with the teacher as 
lecturer and students as passive receptacles. Rather, meaningful instruc- 
tion engages students actively in the learning process. Good teachers 
draw on and synthesize discipline-based knowledge, knowledge of 
student learning, and knowledge of child development. They use a 
variety of instructional strategies, from direct instruction to coaching, to 
involve their students in meaningful activities-discussion, group proc- 
ess, hands-on projects— and to achieve specific learning goals. Good 
teachers constantly assess how their students are doing, gather evidence 
of problems and progress, and adjust their instructional plans accord- 
ingly. 

In this chapter we review the educational and societal trends that 
support these new visions of teaching and learning, which have led to a 
need for new forms of assessment (see Figure 2.1). These same trends 
place unprecedented demands on teachers* professional skills, requiring 
them to integrate knowledge of intended goals, learning processes, 
curriculum content, and assessment. 
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Changes from behavioral to cognitive views of learning and assessment 

• From sole emphasis on the products or outcomes ol' student learning to 
a concern (or the learning process 

• From passive response to active construction of meaning 

• From assessmem of discrete, isolated skills to integrated and cross- 
disciplinary assessment 

• Attention to metacognition (self-monitoring and learning to learn skills) 
and conative skills (motivation and other areas of affect that influence 
learning and uchievement) 

• Changes in the meaning of knowing and being skilled— t'rom an accu- 
mulation of isolated facts and skills to an emphasis on the application 
and use of knowledge. 



2. From paper-pencil lo authentic assessment 

• Relevance and meaningfulness to students 

• Contextualized problems 

• Emphasis on complex s! Us 

• Not single correct answer 

• Public standards, known in advance 

• Individual pacing and growth, 

3. Portfolios: from single occasion assessment to samples over time 

• Basis for assessment by teacher 

• Basis for self assessment by students 

• Basis for assessment by parents, 

4, From single attribute to multi-dimensional assessments 

• Recognition of students' many abilities and talents 

• Growing recoj-'nition of the malleability of student ability 

• Opportunities for students to develop and exhibit diverse abilities. 

5, From near exclusive emphasis on individual assessment to group 
assessment 

• Group process skills 

• Collaborative products. 
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Facing New Demands on Education 

Consider what futurists' predictions imply for educational goals and for 
the kinds of skills students and society as a whole will need for the 21st 
century (Benjamin 1989). Knowledge is exploding geometrically; the 
world's knowledge base has quadrupled in this century (Cornish 1986). 
Given this pace, no one individual can be expected to keep up with o 
information flow in a single discipline, much less across disciplines. 
Such a knowledge explosion makes futile most attempts to have students 
memorize and regurgitate large bodies of facts. 

Economic trends also push us away from a fact-based curriculum. 
The shift from a manufacturing- to an information- and service-based 
economy requires that individuals have skills in accessing and using 
information and in working with people. These changes in the workforce 
and in the pace and complexity of modern life suggest that people will 
need to be flexible, to shift jobs frequently, and to adapt to change. To 
prepare students for success in the future, schools must emphasize how 
to apply rather than just acquire information. 



Using Cognitive Learning Theories 

New cognitive theories of learning propel us in similar directions. Early 
k^arning theories assumed that complex skills were acquired bit-by-bit 
in a carefully arranged sequence of small prerequisite and component 
skills, often articulated in discrete behavioral objectives. It was assumed 
that rote basic skills should be taught and mastered before going on to 
"higher-order," complex thinking skills. Evidence from contemporary 
cognitive psychology, however, indicates that learning is not linear and 
is not acquired by assembling bits of simpler learning. Learning is an 
ongoing process during which students are continuallv receiving infor- 
mation, interpreting it, connecting it to what they alreadv knowand have 
experienced (their prior knowledge), and reorganizing and revising their 
internal conceptions of the world, which are called **mental models," 
"knowledge structures," or "schema." 



Learning's Active Nature 



From todays cognitive perspective, meaningful learning is reflective, 
constructive, and self-rogulated (Wittrock 1991, Bransford and Vye 1989! 
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Marzano et al. 1988, Davis et al. 1990). People do not merely record 
factual information but create their own unique understandings of the 
\vorld— their own knowledge structures. To A'nou' something is not just 
to passively receive information, but to interpret it and incorporate it 
Into one's prior knowledge. In addition, we now recognize the impor- 
tance of knowing not just how to perform, but also when to perform and 
how to adapt that performance to new situations. The presence or 
absence of discrete bits ot information, which is typically the focus of 
many traditional multiple-choice tests, is not of primary importance in 
the assessment of meaningful learning. Instead, we care more about how 
and whether students organize, structure, and use that information in 
context to solve complex problems. 



Learning Is Not Linear 

Learning does not best proceed in discrete hierarchias. Because learning 
is not linear and can take many directions at once at an uneven pace, 
conceptual learning is not something to be delayed until a particular age 
or until all the "basic facts" have been mastered. People of all ages and 
ability levels constantly use and refine concepts. 

C>urreni evidence makes it clear that instruction emphasizing struc- 
tured drill and practice on isolated facts and skills does students a major 
disservice. Insisting that students demonstrate a certain level of arith- 
metic mastery before being allowed to enroll in algebra or that they learn 
how to write a good paragraph before tackling an essay are examples of 
this discrete skills approach. Such learning out of context makes it more 
difficult to organize and remember the information being presented. 
Applying taught skills later when solving real-world problems also 
becomes more difficult. Students who have trouble mastering decontex- 
tualized "basics" are often put in remedial classes or groups and are not 
given the opportunity to tackle complex and meaningful tasks. 



Learners Are Multitalented 

Current intelligence theories that stress the existence of a variety of 
human talents and capabilities depart from the popular view that intel- 
ligence or ability is a single, fixed capability (Sternberg 1991, Gardner 
1982). Gardner argues that while traditional schooling has emphasized 
only two abilities, verbal-linguistic and logical-mathematical, many 
other important "intelligences" exist, including visual-spatial, kines- 
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thetic. musical, iutrapcrsonal, and intorporsonal. Gardnor claims that all 
individuals have strengths in two or three of those areas. Furthermore, a 
tremendous variety exists in the modes and speeds with which people 
acquire knowledge, in the attention and memory capabilities thev can 
apply to knowledge acquisition and performance, and in the ways in 
which they can demonstrate the personal meaning they have created. To 
be successful with all students, instruction and assessment need to draw 
on more than linguistic or logical-mathematical intelligences and sub- 
scribe to the assumption that a// students can learn. 



I earning Includes Cogni{ion, Metacogniiion, and Affect 

RtH:(mt studies of the integration of learning and motivation highlight 
the importance of affective and mctacognitive (thinking about thinking) 
skills in learning (McCombs 1991, VVeinstoin and Mever 1991). For 
example. Belmont and others (1982) suggest that poor thinkers and 
problem solvers differ from good ones not so much in the skills they 
possess as in their failure to use the skills. Mere acquisition of knowledge 
and skills does not make people into competent thinkers or problem 
solvers. They must also acquire the dispvOsition to use the skills and 
strategies and know when to apply them. 

Research and experience, such as that in the writing field (Gere and 
Stevens 1985. Burnham 1986), demonstrate the value of engaging learn- 
ers in thoughtful consideration of what constitutes excellent work and 
how to judge their own efforts. Providing students with models of 
exemplary performance and encouraging them to reflect on their work 
helps students to understand and internalize high standards. 

Meaningful learning is seen as intrinsically motivating. The long- 
term value of traditional, extrinsic motivators such as grades and stars 
is questionable. Research suggests that these techniques may even de- 
tract from a learner's intrinsic moti^'ation. resulting in lowered mastery 
or performance (Lepper and Greene 1978). 



Learning's Social Context 

The role of the social context in shaping complex cognitive abilities and 
dispositions has also received attention over the past several years. 
Although real-life problems often require people to work together as a 
group, most traditional instruction and assessment have involved inde- 
pendent work. We now know that groups facilitate learning in several 
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ways. Working together with pe^^rs on a common task provides: (1) many 
models of effective thinking strategies: (2) mutual constructive feedback; 
(3) an appreciation for the value of collaborating with others; and (4) help 
in attaining difficult or complex skills or knowledge, 

The demands of a democracy provide other rationales for the value 
of group inquiry. Students who work together in a "community of 
learners" are expected to listen to each other with respect, to reflect and 
build on one another's ideas, to demand evidence to support opinions, 
to assist each other in drawing implications, and to challenge the facts, 
assumptions, and arguments of different points of view (Jones and 
Ponniniore 1990). 



Focusing on a Thinking Curriculum 

A modern approach to curriculum, coined the "Thinking CXirriculum" 
by Lauren Resnick and Leopold Klopfer (1989), strongly ad^v^ates an 
integrated, active view of student learning, The thinking w^rriculum 
stresses the importance of process as well as product, Studenftjfire often 
involved in tasks similar to those encountered in the real woiraTstudents 
carry out tasks requiring complex thinking, planning, and evaluating. 
They solve problems, make decisions, construct arguments, and so forth. 
In this way, they model the process of a professional di^ipHne while 
acquiring knowledge in that discipline. 

According to Fennimore and Tinzmann (1990), the fotiowing four key 
principles characterize a thinking curriculum. 



Promotion of In-depth Learning 

A thinking curriculum helps students acquire the key concepts and tools 
for making, using, and communicating knowledge in a specific field. 
Working knowledge of the field implies an integraffed network of knowl- 
edge and concepts rather than a collection of isolat^ facts. 

In a thinking curriculum, students develop an in-ddpth under- 
standing of the essential concepts and processes for dealing with those 
concepts, similar to the approach taken by experts in tadKling their tasks. 
For example, students use original sources to constrtal^t historical ac- 
counts; they design experiments to answer their questions about natural 
phenomena; they use mathematics to model real-world events and 
systems; and they write for real audiences, 
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Content and Process Objectives In Real-world Tasks 

Rathor than focusing on simple and discrete skills, students cngago in 
llio complex, holistic thinking needed in moot challenges outside the 
classroom. According to Resnick (1989), such real-life thinking often 
involves: meaningful pro(:t\sses of decision making and problem solving: 
collaborating with others: the use of available tools: connection to 
r(5al-\vorld events and objects; and use of interdisciplinary knowledge. 

Holistic Performances in Increasingly Challenging 
Environments 

A Ihinking ciu'riculur does not isolate skills and facts. Rather it includes 
the holistic performance of nieaningfol, complex tasks in increasingly 
ciiallengMig (!nvironninnts. Materials and cont(mt are structured so that 
students gradually nigulate their own learning. This approach ensures 
that learning motivates students and encourages in them a sense of 
efficacy and confidnncc;. 



Connection of Content and Process to Learners' Backgrounds 

A thinking curriculum takes into account the experiences and knowl- 
edge that students bring to school and then expands on and refines this 
prior knowledge by connecting it to new learning, making curriculum 
content relevant to important issues and tasks in the students' lives. 
When students relate school learning to rcahlife issues they are more 
likely to seek and value [\m perspectives of others— peers, teachers, 
parents, community members, and experts. In so doing, they develop 
interpersonal competencies for creating and participating in dialogue 
with individuals who have different perspectives and come from diverse 
backgrounds. 



Linking Assessment and Instruction 

Figure 2,2 summarizes many of the basic learning principles discussed 
in this chapter and describes some of the implications these principles 
have for both instruction and assessment. As Figure 2.2 indicates, 
assessment not only evaluates how much was learned in any particular 
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unit of instruction, but also provides *'real limo" information to students 
and teachers about their progress and ways to improve. 



Figure 2.2 

Linking Instruction and Assessment: 
Impiications from Cognitive Learning Theory 



Theory: Knowledge is constructed. Learning is a process of creating 
personal meaning from new information and prior knowledge. 
Implications for instruction/Assessment: 

• Encourage discussion of new ideas. 

• Encourage divergent thinking, multiple links and solutions, not 
just one right answer. 

• Encourage multiple modes of expression, for example, role play, 
simulations, debates, and explanations to others. 

• Emphasize critical tliinking skills: analyze, compare, generalize, 
predict, hypothesize. 

• Relate new information to personal experience, prior knowl- 
edge. 

• Apply information to a tievv situation. 

Theory: All ages/abilities can think and solve problems. Learning isn'i 
necessarily a linear progression of discrete skills. 
Implications for Instruction/ Assessment: 

• Engage all students in problem solving. 

• Don't make problem solving, critical thinking, or discussion of 
concepts contingent on mastery of routine basic skills. 

Theory: There is great variety in learning styles, attention spans, 
memory, developmental paces, and intelligences. 
Implications for Instruction/Assessment: 

• Provide choices in tasks (not all reading and writing). 

• Provide choices in how to show mastery/competence. 

• Provide time to think about and do assignments. 

• Don't overuse timed tests. 

• Provide opportunity to revise, rethink. 

• Include concrete experiences (manipulatives, links to prior per- 
sonal experience). 

continued 
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Figure 2.2 — continued 

Theory: People perform better when they know the goal, see models, 
know how their performance compares to the standard. 
Implications for Instruction/Assessment: 

• Discuss goals; let students help define them (person£\l and class). 

• Provide a range of examples of student work; discuss charac- 
teristics. 

• Provide students with opportunities for self-evaluation and peer 
review. 

• Discuss criteria for judging performance. 

• Allow students to have input into standards. 

Theory: It's important to know when to use knowledge, how to adapt 
it, how to manage one's own learning. 
Implications for Instruction/Assessment: 

• Give real-world opportunities (or simulations) to apply/adapt 
new knowledge. 

• Have students self-evaluate: think about how they learn 
well/poorly; set new goals, why they like certain work. 

Theory: Motivation, effort, and self-esteem affect learning and per- 
formance. 

Implications for Instruction/Assessment: 

• Motivate students with real-life tasks and connections to per- 
sonal experiences. 

• Encourage students to see connection between effort and results. 

Iheory: Learning has social components. Croup work is valuable. 
Implications for Instruction/Assessment: 

• Provide group work. 

• Incorporate heterogeneous groups. 

• Enable students to take on a variety of roles. 

• Consider group products and group processes. 



Assessments' various forms promote a multiplicity of goals that 
include, but are not limited to, the acquisition of content knowledge. 
Tests are no longer limited to scheduled, timed, pencil-paper tasks for 
individuals to perform alone to show what they know. Assessment now 
takes place in many contexts and includes individual and group work, 
aided and unaided responses, and short or long time periods. Open 
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discussion of performance criteria and standards of excellence among 
teachers, students, and even parents serves as a hallmark of alternative 
assessment. Because assessment is an integral part of instruction, con- 
sideration of instructional goals is the crucial first step in designing 
meaningful assessment tasks and scoring procedures. 
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Determining Purpose 



The first step in assessment design or selection is to know the purpose 
of your assessment: What do you plan to use the results for? What aspects 
of student performance do you want to know about? 

While this book is not intended as a primer on the purposes and uses 
of assessment, you will need to consider your purpose throughout the 
assessment process. Is your primary purpose to assess student accom- 
plishment — for instance, how well have students learned to write sto- 
ries, to communicate orally, to synthesize research? If so, you will be 
most interested in assessing the status or level of student accomplish- 
ment for purposes of grading, special placement, and progress monitor- 
ing, or for school, district, and other extra-school purposes of evaluation 
and accountability. Because the primary intent is to describe the extent 
to which students have attained particular knowledge and skills, your 
assessment should focus on the outcomes or product of student learning. 

However, if your purpose is diagnosis and improvement, such as 
diagnosing a student's strengths and weaknesses, prescribing the most 
appropriate instructional programs, or identifying strategies students 
use well and those they need help with, you'll want an assessment that 
gives you information about the process as well as the outcome. What 
have the students achieved and how did they do it? Process information 
provides such explanations. 
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The purpose and use of \our assessment influence how much atten- 
tion you give ^o collecting evidence of reliability and validity, a topic we 
treat more fully in Chapters 6 and 7. The higher the assessment stakes 
are. the greater the obligation to document reliability and validity. 
Adequate levels of both are essential when results are to be used to 
determine, for instance, students' promotions or placement into special 
classes, or to reward teachers or schools. 



Setting Primary Instructional Goals 

Good assessment demands that you know and are able to articulate your 
major instructional goals. These determine what aspects of performance 
you will want to know about. What do you want your students to be able 
to accomplish in a unit, in a course, in a discipline, or across disciplines? 
What should your instructional program add up to? What should stu- 
dents be able to do at the completion of a unit, a course, or a year of study 
that they were not able to do before? What critical areas of student 
development do you want to influence? 

The answers to these questions define your classroom priorities and 
represent the primary targets of your instructional activities. These same 
priorities should also ground the assessment tasks you require of stu- 
dents. Such a fit contributes to a fair assessment — students have the 
opportunity to acquire the knowledge and skills you are assessing-and 
contributes to a meaningful assessment task that reinforces the skills and 
accomplishments you deem most important. 



Determining Priority Outcomes 

While designating goals may seem simple, it is challenging to set priori- 
ties from among the myriad possibilities. What major fields of knowl- 
edge, skills, and dispositions are worth teaching and worth assessing? 
What outcomes are you trying to achieve? Because performance assess- 
ments require considerable time and energy — both yours and your 
students — you will want to focus on a relatively small number of impor- 
tant outcomes, each perhaps representing a month or a quarter's worth 
of instruction. These assessments should aim at your major learning 
objectives for students. To help define these objecMves, ask yourself this 
series of interrelated questions (to which we have supplied some sample 
responses): 

3x 
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1. What Important Cognitive Skills Do I Want My Students To Develop? 

I want students to be able to: 

• Communicate effectively in writing, or more specifically, to write 
persuasively, to write good descriptions, and to wviie stories. 

• Communicate effectively orally. 

• Analyze literature using plot, character, setting, and theme. 

• Analyze issues using primary source and reference materials. 

• Use algebra to solve everyday problems. 

• Analyze current events from historical, political, geographic, and 
multicultural perspectives. 

• Design and conduct studies to aid decision making about current 
or everyday problems. 

• Use the scientific method. 

• Use different media to express what they know. 

2. What Social and Affective Skills Do I Want My Students To Develop? 

I want them to be able to: 

• Work independently. 

• Develop a spirit of teamwork and skill in group work. 

• Appreciate their individual strengths. 

• Be persistent in the face of challenges. 

• Have pride in their work. 

• Enjoy and value learning. 

• Have confidence in their abilities. 

• Have a healthy skepticism about current arguments and claims. 

• Understand that we all have strengths and that each person is able 
to excel in some way. 

3. What Metacognitive Skills Do I Want My Students To Develop? 

I want them to be able to: 

• Reflect on the writing process they use, evaluate its effectiveness, 
and derive their own plans for how it can be improved. 

• Discuss and evaluate their problem-solving strategies. 

• Formulate efficient plans for completing their independent pro- 
jects and for monitoring their progress. 

• Evaluate the effectiveness of their research strategies. 

4. What Types of Problems Do I want Them To Be Able To Solve? 

I want them to: 

• Know how to do research. 

• Solve problems that require geometric proofs. 

• Understand the types of problems that trigonometry will help 
them solve. 

• Apply the scientific method. 
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• Predict consequences. 

• Solve problems that have no right answer. 

• Make healthy choices. 

• Create their own unique expressions. 

5. What Concepts and Principles Do I Want My Students to Be Abie to 
Apply? 

I want them to be able to: 

• Understand what a de*nocracy is. 

• Understand cause-and-effect relationships in history and in 
everyday life. 

• Understand the meaning of various logical propositions. 

• Criticize literary works based on plot, setting, motive, and so on. 

• Understand and recognize the consequences of substance abuse. 

• Apply basic principles of ecology and conservation in their 
everyday lives. 

Be as specific as possible in formulating your answers to these questions. 
While you shouldn't produce the excruciating detail found in behavioral 
objectives of the past, you should describe your primary outcomes with 
enough detail that others can agree on what the outcomes mean and 
whether or not students have attained them. 



Using Available Resources 

Beyond your own judgments in answering the above questions, you may 
find it helpful to consult curriculum frameworks, respected content 
experts, or innovative projects that reflect your educational philosophy. 
The following are resources you might wish to consider. 



National Curriculum Groups 

The Curriculum and Evaluation Standards for School Mathematics, 
issued by the National Council of Teachers of Mathematics (1989) is a 
useful resource. The standards emphasize developing students' capabili- 
ties to use mathematics in solving problems, in reasoning, and in 
communicating. Further, they encourage students to value mathematics 
and feel self-confident about their ability to do mathematics. For exam- 
ple, the NCTM standards in communication suggest that students be able 
to: 

3;) 
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■ Articulate their reasons for using a particular mathematics repre- 
sentation or solution: 

■ Summarize the meaning of data they have collected; 

■ Describe how mathematical concepts are related to physical or 
pictorial models; and 

■ Justif}' arguments using deductive or inductive reasoning. 

These major goals for student performance may stimulate your thinking 
about goals you want to set for your students in mathematics. 

Actually, many subject discipline groups are developing or have 
developed goal statements. The American Association for the Advance- 
ment of Science (AAAS 1989) makes recommendations for restructuring 
curriculum in the sciences in Science for All Americans: Project 2061, 
The report recommends four goals for science education: understanding 
the scientific endeavor, developing scientific views of the world, forming 
historical and social perspectives on science, and developing scientific 
habits of mind. 

The National Louncil of Teachers of Social Studies, the National 
Council of Teachers of Science, and the National Council of Teachers of 
English are all effective information sources for their disciplines. The 
Center for Civic Education has published Civitas, which coA'ers civic 
education (Quigley and Bahmueller 1991). 



State Curriculum Frameworks 

State curriculum frameworks offer another valuable resource. California 
has led in developing a history-social science framework, including 
history, geography economics, political science, anthropology, psychol- 
ogy, sociology, and the humanities (California State Department of 
Education 1988). The framework includes three major goal areas. Each 
area contains curriculum strands that spiral up through the course of a 
student's education: 

■ Goals of knowledge and cultural understanding 

— historical literacy 

— ethical literacy 

— cultural literacy 

— geographic literacy 

— economic literacy 

— sociopolitical literacy 

3.i 
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■ Goals of skills attainment and social participation 

— basic study skills 

— critical thinking skills 

— participation skills 

■ Goals of democratic understanding and civic values 

— national identity 

— Constitutional heritage 

— civic values, rights, and responsibilities 

Each of these areas comprises a number of learning goals that could be 
the subjects of assessment. For example, the participation skills under 
"goals of skills attainment and social participation" include personal 
skills, group interaction sl;ills, and social and political participation 
skills. "Economic literacy" includes specifics related to the basic eco- 
nomic problems facing all societies; comparative economic systems; 
basic economic goals, performance and societal problems; and the 
international economic system. 

Connecticut has formulated the Common Core of Learning (1987), a 
set of core learning standards for high school students. The standards 
include generic skills that cross disciplines, and the big ideas and the 
skills, concepts, processes, and techniques that characterize a specific 
discipline. Those generic skills provide a starting point for thinking 
about key student outcomes in any discipline. These generic skills are: 

■ Communicating clearly; 

■ Questioning; 

■ Formulating problems; 

■ Thinking and reasoning; 

■ Solving complex, multi-step problems; 

■ Synthesizing knowledge from a variety of sources; and 

■ Using cooperation and collaboration. 

Connecticut's science skills, processes, and techniques, which are also 
general, include: 

■ Developing a hypothesis; 

■ Designing experiments; 

■ Drawing inferences from data; 

■ Using observation and analyzing similarities and differences in 
phenomena; and 

■ Working with laboratory equipment. 




28 



DL-TLRN^INING PURP(3SL 



Other Resources 

Frameworks developed for national and international assessments offer 
another source of information for assessment development. The National 
Assessment of Educational Progress (NAEP) regularly assesses student 
performance in mathematics, language arts, science, history, geography, 
and adult literacy. As part of their process. NAEP conducts a national 
consensus process that defines the content framework for each assess- 
ment and sets priorities for student accomplishment. (For more informa- 
tion, contact the Educational Testing Service, Rosedale Road, Princeton, 
N) 08541; telephone (609) 921-9000). 



Tapping School Restructuring Efforts 

Groups involved in school restructuring efforts offer an additional re- 
source. For example, central to the efforts of the Coalition of Essential 
Schools is a final exhibition in which students demonstrate their accom- 
plishments. Coalition members have thought carefully about what the 
nature of those accomplishments should be. Various schools have de- 
fined visions of what their graduates should be like at the end of a course 
or school. Here are just a few examples: 

■ Students in this course will have a greater understanding of many 
of the issues that their generation faces. They will speak and write 
about current events knowledgeably, inquisitively, and honestly. 
And they will reflect c:arefully about their roles as presenters of 
information (Parkway South, contemporary issues). 

■ Students . . . .most importantly, will know how to apply geomet- 
ric concepts to real-world situations (Sullivan High School, 
mathematics). 

■ Students in this c:ourse will know how to work together to 
produce informative work of high ciuality. They will have a solid 
grasp of the field techniques required to study ecology. They will 
feel proud knowing they have made a tangible contribution to 
their community . . . .And perhaps most important they will have 
a strong understanding of and commitment to the natural envi- 
ronment in which they live (Sullivan High School, ecology). 

■ Graduates of this school will know how to explore ideas in a deep 
and meaningful way. and they will be able to express their 
thoughts eloquently, cohesively, and correctly (Sullivan High 
School, humanities). 
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■ Graduates of Metro High School will have a solid understanding 
of their interests and talents. They will leave the school confident 
that they have the skills necessary for their future goals, which 
they have carefully researched and planned for, 

■ Graduates of this school will be motivated, insightful, and dis- 
criminating adults who think independently and responsiblv. 
They will have considerable knowledge of subject matter, well- 
developed learning skills , . . (Crefeld School). 

Vision statements and descriptions of the exhibitions process are found 
in The Exhibitions Collnctions, developed and distributeci by the Coali- 
tion at Brown University. (Contact Joe McDonald, Coalition of Essential 
Schools, Brown Ihiivorsity, Box 1969, Providence, RI 02912: telephone 
(401) 863-3384: FAX (401) 863-2045.) 

Other sources of significant and innovative student outcomes include 
Henry Levin's Accelerated Learning Project (1989): James Comer's Pro- 
ject (Comer and Haynes-Norris 1991); Elliot Wiggington's Foxfire Project 
(Puckott 1989): and the Galef Institute's Different Ways of Knowing 
curriculum (Galef Institute 1992). 



Considering Interdisciplinary Goals 

Many of the new frameworks being developed show incnjasing appre- 
ciation of interdisciplinary outcomes. The NCTM mathematics stand- 
ards show attention to communication skills. The AAAS sees math, 
science, and technology as integrally related and recommends that 
students understand how powerful ideas of science emerged from par- 
ticular historical, cultural, and intellectual contexts. The California 
history-social science framework exemplifies an interdisciplinary cur- 
riculum approach as do many of the exhibitions in Coalition Schools. 
As you proceed with assessment development, you too may want to 
consider emphasizing interdisciplinary goals for your students. 



Consulting with Colleagues 

What are the specific targets of your classroom curriculum and your 
instructional program? In consulting available resources to answer this 
question, don't neglect your colleagues, Collegial collaboration engen- 
ders schoolvvide consensus building and better assessments. If you're 
working on a department, school, or districtwide assessment, vou may 
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want to include ]:)arents, community moinbors, andlE^presentativcs from 
the business sector in the deliberation process. ^ 

r 

Setting Meaningful Priorities: 
A Difficult Proposition 

Alone or as part of a group, you'll probably find you've generated a long 
list of possible targets for performance assessnient. As you review your 
list, either alone uv in collaboration with others, you could use the 
following questions to aid in focusing your assessment: 

1. Mow much time will it take students to develop or acquire the skill 
or acc:omplishment? If the answer is an hour, a day, or even a week, 
it's probat^ly not worth the time and effort of a full performance 
assessment. 

2. How dons tlu; desired skill or accomplishment relate to other 
complex cognitive, social, and affective skills? Higher priority 
should h(^ given to skills that are integrally related to other imjDor- 
tant skills. Give jjriority to those that will apply to a lot of 
situaticms, 

3. f low does the dtisired skill or atxomplishment relate to long-term 
school and ciirricular goals? (jive priority to the hmg-term goals, 
or integral components of important long-term ^dals. 

4. Mow does the desired skill or accomplishment relate to your 
school improvement plan? Give priority to those that are valued 
in the plan. 

5. What is tlie intrinsic importance of the desired skills or accom- 
plishment? Glearly give priority to those that are important and 
cliscard any that represent superficial or trivial goals. (While this 
seems obvious, think of how many test items you've answered 
about trivial details,) 

f). Are tho desired skills and accomplishment teachable and fittain- 
able for your students? While seeking to challenge studentj md 
to draw the highest accomplishment from all students, pay atton- 
tion to whether or not your students have the necessary prereq- 
uisite skills, concepts, and principle knowledge to attain your 
goals and whether you have the materials and capability to help 
them reach these goals. 

As a result of this type of decision-making process, you will identify 
what you believe to be a critical set of skills and accomplishments. Each 
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should bo doscribod vviUi sufficionl spocuficity so that others undorstand 
Uicir moanings. Whih^ you may neod to revisit and roviscj thijsn initial 
descriptions, this priority list will outline the initial targets for the 
assessment design. 

To learn how to develop and conduct alternative assessments, you 
may want to start with a single assessment. Consider the student out- 
conujs you value; most, the time of year, and where you are in the 
curriculum, and then designate; one of your priority outconu;s as a first 
target. Your next job will be to identify appropriate tasks for assessing 
that target. 
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Selecting Assessment Tasks 



The koy to good assessment is matching tho assessment task to your 
intended studnnt outoomes (th(^ knowledge, skills, and dispositions you 
identified in your initial assessment planning). What tasks or assign* 
ments represent these intended accomplishments? You can create many 
interesting and suitable possibilities. When considering assessment 
tasks, your best choices are those you believe most closely target your 
instructional aims and allow your students to demionstrate their progress 
and capabilities. 

As you create interesting tasks for students, you may find that some 
don't fit your originally designated priorities, but do represent important 
goals you may have overlooked. This is an example of how the assess- 
ment dcvelopme * process is nonlinear. Decisions at each step are 
influenced by the .e that precede and follow it. Many teachers find it 
easier to firticulate valued student ::utcomes after thinking about the 
kinds of s' udent assignments they find most interesting, challenging, and 
worthwhile. 

A number of issues need to be considered in designing appropriate 
assessmrmt tasks. Figure 4.1 provides a conceptual overview of such 
issues. Figure 4.1 also clearly portrays the difficulty in thinking about 
assessment tasks without simultaneously thinking about the criteria 
you'll use to judge performance on those tasks. While we deal with 
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performance criteria in Chapter 5, the separation offers an example that 
developing assessment is neither simple nor Hnear. 



Choosing Good Tasks 

Answering these questions will help you choose effective assessment 
tasks. 

Does the Task Match Specific Instructional Intentions? 

When trying to assess a single outcome, it is easy to come up with task 
ideas. For example, if you want students to communicate effectively in 
writing, it seems obvious that you should require them to write. But what 
should they write? If you have not already set specific instructional 
goals— for instance, the specific kinds of writing you want students to 
do: narrative, expository, and persuasive — now is the time. Similarly, if 
you want students to be able to apply the scientific method, having them 
complete experiments or conduct focused studies seems natural, but 
you'll also need to decide what specific content and skills the task shv^uld 
involve. What kind of experimenls? What kinds of studies: A study of 
the composition of compost? A community needs survey? A school 
study of dietary habits? It is important that the assessment task match 
the specific instructional outcome it is designed to measure. 

Does the Task Adequately Represent the Content and Skills 
You Expect Students to Attain? 

According to modern learning theory, content and process are inextrica- 
bly linked. For example, social studies thinking differs from mathemati- 
cal thinking. Being able to summarize biology content in writing draws 
on different knowledge and skills than being able to compose a summary 
of literature. Therefore, beyond specifying the general nature of the task, 
you need to think about the specific topics or subject areas you will ask 
students to address. For example, if you want students to write persua- 
sive pieces, what will provide the basis for their writing? Will it be a 
hypothetical problem, a school problem, a personal dilemma, a current 
event, a local issue, a mathematical solution, or an ethical probleiri? 
What range of content do you expect them to use— prior knowledge, 
additional research, or personal knowledge? 
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Suppose you want students to be able to do science experiments, in 
particular, chemistry experiments, for problem solving. In deciding on 
an assessment task, you'll need to consider additional specific content 
issues. What types of substances should they be able to deal with? What 
types of problems — analysis, design, or evaluation? And what types of 
chemical properties and reactions do you want them to incorporate? 
What types of equipment should they know how to use? In short, what's 
the range of content, concepts, principles, and techniques with which 
students should be familiar, and based on these, what's a good example 
of what you expect of students? Do you want them to analyze unknown 
substances with particular properties, predict which of several products 
will work best for a particular purpose, or determine which crop is most 
cost-efficient for reducing hunger? 



Does the Task Enable Students To Demonstrate Their 
Progress and Capabilities? 

Thinking through the specific content you expect in student perform- 
ance raises Severn^ interrelated issues about task fairness and potential 
bias. What does tiie task assume about the students' prior knowledge? 
Have your students had the opportunity to acquire this knowledge? Does 
the task include skills that are irrelevant to your intended assessment 
goal? In other word, is the task a fair assessment of what students know 
and can do and will students be able to show their talents and capabili- 
ties? To take another example from writing, we know that students need 
background knowledge on the topics about which they're expected to 
write. Without this knowledge, they have nothing to say. Your estimate 
of students' writing skills is always embedded in what students know 
(or don't know) about the designated topic. As you formulate specific 
topics for students, pay attention to the interrelationship between con- 
tent and skill. Don't hinder students' abilities to demonstrate their skills 
by throwing something into the assessment that may be irrelevant to your 
aims. For instance, if your students are not well versed in current events, 
don't expect them to write an eloquent piece taking a position on a 
current national issue. Or if your students are not good readers, don't 
hinder their ability to show their writing skills by having them write 
about an article you've given them from The New York Times, Of course, 
you have reading goals for students and you may well want them to 
acquire knowledge about current events, but don't unintentionally con- 
found thoir ability to show specific skills, or mislabel them as unskilled. 
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based on an inappropriate task or inadequate opportunity to acquire 
necessary prior knowledge and skills. 

One solution to the prior knowledge dilemma is to provide students 
access to relevant resources, which they know how to use, as part of the 
assessment situation. For example, high school chemistry students in 
Connecticut must design and conduct experiments to determine which 
of two unknown substances is a diet drink and which is a sugared drink. 
The task assesses different things depending on which textbooks and 
other resources students are allowed to use. If teachers limit such re- 
sources, students' performance will depend on whether they remember 
specific tests for and the chemical composition of sugars. Students who 
do not readily recall these facts will not get far in setting up or completing 
appropriate tests. On the other hand, if teachers allow students access 
to resources, the task more directly assesses whether students know how 
to design and conduct scientific experiments, assuming, of course, that 
their textbooks do not contain the solution for the problem. Which is the 
better approach? The answer depends on the teacher's intentions and 
expectations. 

Another solution to the prior knowledge dilemma is to provide 
students a range of options in your assessment task, for instance, by 
giving them their choice of expression— written, oral, visual, or musical, 
and a range of tasks of varying difficulty. 



Does the Assessment Use Authentic, Real-world Tasks? 

Modern curriculum theorists emphasize the importance of engaging 
students in authentic, real-world tasks because they seem more motivat- 
ing and have greater transferability than more traditional, decontextual- 
ized academic tasks. Th.ese theorists also believe that engaging students 
in the process of a discipline as they acquire or demonstrate knowledge 
in that discipline is a powerful learning strategy. The Connecticut 
chemistry task, for instance, engages students as scientists and asks them 
to answer a question they are familiar with in their real world. 

Similarly, the Content Assessment Prototype in history, developed by 
Eva Baker and colleagues (1992) at CRESST, engages students in authen- 
tic tasks of historians. Students arc asked to read primary source mate- 
rials, such as an abridged version of the Lincoln-Douglas debates. They 
must then draw on their prior knowledge and understanding to explain 
the historical issues addressed by these documents, and incorporate the 
historical content— the problems and issues facing the nation prior to 
the Civil War— in their answer. To provide an authentic purpose for the 
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task, the assessment protocol also establishes an appropriate audience 
for students' responses. 

Real-world problems, realistic techniques, and authentic audiences 
raise innumerable possibilities for tasks. Social studies teachers might 
have students chooso a current problem to research and then write a 
letter to Congress or the city council, or design a public service adver- 
tisement to advocate a solution. Science teachers might have students 
write letters to their newspapers or state senators, or create a video about 
ecological problems. Math teachers might have students conduct a 
survey of community needs and write a report or figure out how much 
money they'll need to support their future goals, such as buying a car. 
considering the purchase price, loan/interest costs, insurance, taxes, 
license, maintenance, gas, and so forth. 



Does the Task Lend Itself to an Interdisciplinary Approach? 

Authentic, real-world problems don't always conform neatlv to separate 
curriculum domains. Instead, students have to engage knowledge from 
a variety of disciplines and perspectives. The "letter to the editor about 
solving an ecological problem" draws on students* communication 
skills, their science skills in understanding specific ecological problems, 
and their interpersonal skills in understanding their audience. In an- 
other example, a research project task might require a student to research 
a topic, to design an empirical study based on the scientific facts and 
principles they research, to use math skills to analyze and display the 
data from their study, and to apply both their science and communica- 
tion skills to summarize results and report them to others. 

Interdisciplinary tasks offer additional advantages in time and meas- 
urement efficiencies. In reality, meaningful performance tasks often take 
extended periods of time, and there simply may not be enough time to 
assess all content areas separately. Interdisciplinary tasks help teachers 
avoid this potential problem. 



Can the Task Be Structured To Provide Measures of 
Several Goals? 

It's easy to see that interdisciplinary tasks can be assessed from the 
perspectives of the separate disciplines involved. For example, because 
the letter to the editor requires writing skills, interpersonal skills, and 
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science understanding, you can score it separately for performance in 
each of theSG areas. 

Most assessment tasks designed to measure meaningful goals will 
also incorporate a range of cognitive, metacognitive, affective, and social 
skills. For example, the chemistry "soda" task, which takes place over 
several days, includes these components: group work, individual w^ork, 
an oral report, and self- and group reflection. In small groups, students 
must first brainstorm a list of tests that will enable them to determine 
which of two samples of soda is sugared and which is diet. They then 
conduct two tests, analyze their results, and present an oral report to the 
class. Each student is also asked to solve another parallel chemical 
analysis problem. Students reflect on the strengths and weaknesses of 
their performances as individuals and as group members, on the per- 
formance of other members of the group, and their attitudes toward the 
task. 

Structuring "mega-tasks" to assess a variety of outcomes requires 
ingenuity. If your high-priority goals include both group and individual 
work, you might have students work as a group to solve a problem, but 
work individually during one or more stages of the project, by having 
each student individually collect and summarize information for a group 
project. Alternatively you might want students to work in groups to 
define and solve a particular problem but have each student present a 
report of the group's findings. If you want to assess the extent to w^hich 
students acc:ept challenges and try to solve problems despite the effort 
and difficulties involved, you would need to leave enough challenge and 
choice in the assessment task that students can exhibit more or less 
enthusiasm, effort, and persistence; and include ways for you to observe 
behavior and affect. Chapter 5 discusses the criteria with which you can 
judge behavior and affect. 

Be aware that while there are advantages and efficiencies in designing 
such multidimensional, complex, and rich assessment tasks, there are 
also disadvantages. Chief among these is teasing out from students' 
responses what is attributable to the skill they've acquired, what is prior 
knowledge, and even what each student's individual achievement level 
is. For example, students with limited writing skills will be inhibited 
from adequately demonstrating their actual level of understanding 
through writing. Students who are not highly motivated in the face of 
challenge may quit a long task before they are able to show their level of 
competence. And if students participate in group work for part of the 
task, it may be more difficult to judge each individual's achievement. 
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Generating Good Ideas for Tasks 

Brainstorming with colloaguns is p good strategy for dov^oloping initial 
ideas for good assossmnnt tasks. You can begin by thinking about th(^ 
more complex and successful instructional projects you and your col- 
leagues have assigned in the past. Remember the first rule of brainstorm- 
ing: be creative, write down evf»rything that comes to mind, and don't 
criticize any ideas until they are all out on the table. Then combine, 
refine, and embellish the best aspects of each. 

Beyond your own ideas, capitalize on the efforts of others. You can 
adapt nnd enhance ideas gathered from professional journals, confer- 
enc:es and training sessions, observations of other teachers' classes, and 
so forth. Be aware that a number of states, school districts, and schools 
are working to develop those new kinds of assessments. If your state has 
its own assessment, it may be an idea sourc:e. CRESST is assembling a 
database of efforts across the country and will be distributing them 
through ERIC. The database will include samples of performance assess- 
ments in a variety of subject areas and at a variety of grade levels. While 
none of these samples will suit your needs and purposes totally, you can 
borrow the assessment ideas they represent, the scales thev use for 
scoring student performance, and so forth. Even if you are not a chem- 
istry teacher, you may be intrigued by Connecticut's approach to rating 
group process and may adapt their scciles for your own group work. The 
Lincoln-Douglas/Civil War cissessment of understanding and explana- 
tion described earlier may provide you with a simiilar approach in 
assessing students' social studies, science, or art understanding. 

If the assessments you are developing are part of a schoolwide effort, 
consider involving others in the school coumiunity—parents, business 
representatives, and community members. Those outside the school may 
be particularly helpful in generating real-world, authentic tasks that 
exemplify important thinking, problem-solving, and communication 
skills for students. They can be helpful also as "task reviewers" and in 
alerting you to the kinds of knowledge, relevant and irrelevant, these 
tasks ri^present. 



Describing Your Assessment Task 

Formal assessment tasks need to be carefully specified or documented 
so that others can interpret the results or can repeat your methods with 
other students in other settings. Perhaps even more important, because 
assessments are supposed to ivpn^^rnt how a student performs in a larger 
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domain, it is important that you know what that broader domain is. A 
task description helps to describe the larger domain, provides a blueprint 
for other specific assessments that might be drawn from it. and allows 
you to review your work and catch major problems before you try them 
out on students. 

While the nature of the assessment task will dictate what needs to be 
specified, the following aspects usually need specification: 

■ What outcome(s) are intended for the assessment? 

■ What are the eligible content/topics? 

■ What is the nature and format of questions to be posed to stu- 
dents? What is the audience for the response? 

■ Is it group or individual work? If group work, what roles are to be 
filled? 

■ What options/choices are allowed? What are the choices in re- 
sponse mode? What will they include, for example portfolios? 
Who makes the choices — the teacher or students or both? 

■ What materials/equipment/resources will be available to stu- 
dents? Are there any specifications? 

■ What directions will be given to students? 

■ What administrative constraints are there? How much time is 
allowed? What is the order of tasks? How will student questions 
be answered? What help will be allowed? 

■ What scoring scheme and procedures will be used? 

Figure 4.2 on page 42 provides a sample template for your task descrip- 
tion. The checklist summarizes both the major concerns associated with 
creating assessment tasks and the scoring issues to be addressed, which 
are discussed in the Chapter 5. 

Ensuring That Your Tasks Lead 
to Sound Assessments 

Given the complexity of task development, you will want to review your 
tasks prior to piloting them with students. These criteria can help you 
critique your assessment ideas before developing them fully: 

■ Do the tasks match the important outcome goals you have set for 
students? Do these goals reflect complex thinking skills, such as 
analysis, and synthesis. 
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■ Do they pose an enduring problem type— the types of problems 
and situations that students are likely to face repeatedly in school 
and their future lives? 

9i Are the tasks fair and free of bias? For example, do they favor 
either boys or girls, students w^ho have lived in a particular 
location or region, students vj\\h a particular cultural heritage, or 
those w^hose parents can afford to buy certain materials? 

■ Will the tasks be credible to important constituencies? Will thev 
be seen as meaningful and challenging by students, parents, and 
teachers? Do the tasks rely on quality subject mattev content? 

■ Will the tasks be meaningful and engaging to students so that they 
w^ill be motivated to show their capabilities? Do the tasks involve 
real problems, situations, and audiences? 



Figure 4.2 

Checklist for Your Task Description 



Outcomes to Be Measured 


• Description of instructional goals 

• Eligible content/Topics 

• Rules/Process for selection 


Assessment Administration 
Process 


• Group/Individual roles 

• Materials/Equipment 

• Administration instructions 

• Help allowed 

• Time allowed 


Actual C^Juestion/Problem/ 
Prompt 


• Format 

• Audience 

• Options available 

• Student directions 


Scoring 


• Rubric/Criteria 

• Scoring procedures 

• Use of scores 
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■ Are the tasks instructionally related/teachable? Do they repre- 
sent skills and knowledge that your students can acquire and that 
you have the materials and expertise to adequately teach? 

■ Are the tasks feasible for implementation in your classroom or 
school in terms of space, equipment, time, costs, and so forth? 
Are they feasible for students to accomplish in terms of outside 
of school requirements, including family and other demands on 
students' time, access to libraries and other resources, and afford- 
ability. 

These criteria are derived from the more general CRESST criteria for 
judging assessment quality (Linn et al. 1991). Consideration helps them 
ensure that assessments yield valid inferences about students and pro- 
grams. 
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Setting Criteria 



Tho criteria usod for judging stuciont performanco lie at the heart of 
altenuitivo assessment. Although we have discussed selecting and de> 
scribing assessment tasks separately from developing scoring criteria, 
th63se three aspects of assessment arc intimately intertwined. In the 
absence of criteria, assessment tasks remain just that, tasks or instruc- 
tional activities. Perhaps most important, scoring criteria make public 
what is being judged and, in many cases, the standards for acceptable 
|)erformance. Thus, criteria communicate your goals and achievement 
standards. 

Like "alternative assessment" itself, criteria for judging student per- 
formance have been called many things, including scoring criteria, 
scoring guidelines, rubrics, and scoring rubrics. For our purposes, we 
take all these terms to mean a description of the dimensions for judging 
student performance, a scale of values for rating those dimensions, and. 
when appropriate, the standards for judging performance. 

Lot's take a common example from social studies. You assign students 
a group presentation accompanied by individual written reports to 
assess their understanding of history. Because you wish to assess three 
skills — oral, written, and group process skills as they relate to history— 
you must consider scoring criteria for each skill. Figure 5.1 on pages 
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4fi-47 is a passiblo sol of scoring criloria for just ono of thos(^ skills, a 
history group process assessment devolopod by the (California Assess- 
ment Program.^ 

The group process excu'cisc^ taps four learning outcomes: group learn- 
ing, critical thinking, communication, and history knowledge. Fore^ach 
outcome, scoring dimensions are specnfied and levels of performance 
differentiated by a scoring scale, Finally, the scoring guide includes an 
evaluation of (^ach pt^rformance level, labeling performance not only in 
terms of what was accomplished but how well, from minimal to excep- 
tional achievenuMit. 



Understanding the Need for Criteria 

(Critc^ria are necessary because they he,lp you judge complex human 
performance in a niliable, fair, and valid manner. S[:oring criteria guick^ 
your judgments and make pul)lic to students, parents, and others the 
l)asis for these judgnuMits. Scoring a multiple-choice test does not require 
complicated judgment; nevertheless, human judgment is still a factor 
because the test developer phrases the questions and decides what 
constitutes the best answers. To the person who scores the test, a student 
eithcjr has or has not selected the correct answer; no judgment is needcul. 
VVhtm wti use sol(ic:ted-response tests, we are essentially corroborating 
the judgments about adequate performance built into the "answer key." 
Thus, all assessment, be it selected- or constructed-response, has a 
subjective or human judgment component. 

Alternative assessments invite a wider range of possible responses. 
Instead of judging responses as right or wrong, alternative assessments 
judge the quality of, and sometimes the process of. arriving at a complex 
response. To make such judgments and to ensure their validity, consis- 
t(uicy, and fairness, we need criteria or scoring guidelines. Scoring 
criteria must be well-conceived, explicitly defined, and consistently 
applied. Well-specified criteria help to ensure that everyone under- 
stands what is expected. 

Well-art iculat(id and publicly visible criteria for judging studcMit 
responses are necessary and useful whether the results will be used in 
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the classroom or to make school level or national decisions. In all 
assessment settings, scoring criteria must: 

■ Help teachers define excellence and plan hou' to help students 
achieve it. 

■ Communicate to students what constitutes excellence and hov^^ 
to evaluate their own work. 

■ Communicate goals and results to parents and others. 

■ Help teachers or other raters be accurate, unbiased, and consis- 
tent in scoring. 

■ Document the procedures used in making important judgments 
about students. 

Criteria and Instructional Planning 

Scoring criteria clarify instructional goals. Along with thr task descrip- 
tion, the criteria define priority outcomes in terms of the content to be 
covered, the knowledge or skills to be demonstrated, and the context in 
which these are to occur. The complete alternative assessment specifi- 
cations can guide selection and sequencing of relevant instructional 
activities. 



Criteria and Students 

The criteria for alternative assessments are often made public and are 
intended to be discussed with students. Public discussions help students 
to internalize the standards and "rules" they need to become inde- 
pendent learners. Alternative assessments and their criteria can be 
woven into the fabric of the curriculum so that they are transparent to 
the student and perceived as a natural part of the learning process. Such 
assessment is ongoing and takes many forms — journals, conferences, 
peer or teach(ir coaching episodes, critiques of products and exhibitions, 
and formal evaluations of individual works or a body of work. Examples 
of what constitutes good work engage students in the work itself and in 
judgments about their work. Public discussions of quality and criteria 
inform students during the formative period of instruction, not simply 
at th(^ end of a unit or course when it is too late to make improvements. 
Furthermore, discussions of criteria also help students see the perspec- 
tives of their teachers, their peers, and sometimes even the experts in the 
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Criteria and Parent Involvement 

Clearly articulated criteria also communicate to parents and others what 
the teachers and schools are trying to accomplish. Criteria operationalize 
learning goals and expectations for children. When parents know prior 
to grading what is expected, they can support their child's learning. For 
example, giving parents of kindergartners a copy of "Profile of Develop- 
mental Outcomes for Kindergarten" (Figure 5.2] allows them to work 
with their children at home on activities such as recognizing beginning 
letters or sight words. The road to literacy is well-marked; teachers who 
share the map with parents may find that more of their students reach 
their destinations in a timely manner. 

Good criteria help both students and parents share some of the 
responsibility for learning. Parents and children who are familiar with 
the standards by which work is judged are less likely to ascribe poor 
performance to such external factors as not being told what was impor- 
tant or personality conflicts between teachers and students. 



Criteria and Consistency 

When guidelines for what constitutes good work are vague or unstated, 
it is difficult to be consistent, fair, and accurate in judging student 
responses. With selected-response tests, accuracy and consistency in 
scoring refers to whether the test score for an individual pupil remains 
fairly stable from one testing occasion to another, in the absence of 
intervening instruction or growth. This consistency is better known as 
reliability. For alternative assessments, reliability includes not only the 
idea of the stability of an individual student's performance over time but 
also the stability of a rater's judgments of that performance. Specifically, 
a reliable assessment that depends on human judgment must meet the 
following requirements: 

■ Several judges looking at a specific task would come to the same 
conclusion about a student. 

■ Each judge would rate the student's performance on a specific 
task about the same on a subsequent occasion. 

■ The student would perform the same task at about the same level 
on different occasions. 

■ If the task is meant to represent or generalize to some larger 
domain, the sample is representative of that domain. 
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It is easy to see how these four requirements for reliable scoring demand 
a mechanism for creating rater agreement and for delineating clearly the 
domains of particular assessment tasks. Scoring criteria must meet this 
demand. 



Criteria and Consequences 

Specifying criteria is always important and becomes even more so when 
the consequences of an assessment are very serious, such as when resuhs 
are used for retention, graduation, or placement in special programs. 
Clear gnidolines for evaluating student work ensure appropriate conse- 
quences for students and the educational system as a whole. Further- 
more, when alternative assessments are used for those high-stakes 
decisions, the scoring procedures and criteria must be legally defensible 
and adhere to the due process standards of a court of law. 



Specifying Criteria 

Different testing purposes require uiti'erent kinds of scoring criteria. 
Many of the examples in this book were developed for state-level 
assessments with such high-stakes testing purposes as comparing 
schools, identifying low-performing schools, and evaluating individual 
schools. The California Assessment Program (CAP) history group proc- 
ess criteria (shown in Figure 5.1.) are an example of the complex criteria 
used in high-stakes assessment. Because the criteria are used for a 
one-shot state assessment, the scoring guide was developed to extract 
the maximum amount of information possible during limited assessment 
time. Wo see that the criteria: 

■ List multiple learning outcomes. 

■ Divide each outcome into performance levels. 

■ Desc:ribe traits/characteristics for each level. 

■ Provide a numerical scale to rate the degree to which each level 
was attained. 

■ Hvnluate the quality of student performance represented by the 
different Kjvels using such descriptors as "minimal achitjvonient" 
or "oxcellenl achievement/' 

Your criteria will Ujss complex when your testing purj^oses nre mom 
focuscMl and the decisions you wish to make about slud(uits are limited. 
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If you are using student academic journals to monitor their progress in 
making connections between science lessons and their daily lives, your 
scoring criteria may be to count the number of unprompted statements 
connecting classroom learning with out-of-class experiences. The num- 
ber of connections you find will tell you whether you are achieving your 
goals. Your assessment purpose here may be formative — to improve your 
instruction and to identify students who need more help or a different 
approach. 

Perhaps your assessment purpose is more traditional — you want to 
evaluate student progress toward meeting your goals in mathematics 
problem solving. Your scoring criteria might resemble the generalized 
rubric for essay-type mathematics problem^^ developed by the CAP 
(shown in Figure 5.3). The criteria provide desL; \.>tions of each level of 
performance in terms of what students are able i' do, assign values to 
these levels, then apply standards at certain cut p. '.its. Students rated 
1-2 are evaluated as having "inadequate" responses; students rated 3-4 
receive a "satisfactory": and students receiving 5-6 are rated "compe- 
tent/* 

While grading is a complex issue and the scores of any one alternative 
assessment may or may not be used to assign grades, it is possible to find 
or develop criteria linked specifically to letter grades. Researchers 
funded by the National Science Foundation l^ave developed a grade- 
linked set of criteria to assess student's procedural knowledge in a 
hands-on science experiment (Baxter et al. 1992). The researchers deter- 
mined which methods students could use to solve the problem posed bv 
the experiment, judged which would produce the most logical and 
efficient solutions, then created grade-referenced criteria to reflect their 
evaluations of the solutions. A summary of how their criteria is linked 
to grades appears in Figure 5.4. 

Regardless of the testing purpose, the sample criteria have four 
common elements. Each has 

■ One or more traits or dimensions that serve as tho basis for judging 
the student response 

■ Definitions and examples to clarify the meaning of each trait or 
(limensi(jn 

■ A scale of values (or a counting system) on which to rate each 
dimension 

■ Standards of excellence for specified performance levels acconi- 
j)anied l)y models or examples of each levijl. 
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Figure 5.3 

CAP Generalized Rubric 

(Gilifornia Slate Department ot' Education 1989) 



Demonstrated Competence 
Exen>.plary Response . . . Rating = 6 

Gives a complete response with a clear, coherent, iinambij^LKHis, and elej^ant 
explanation; includes a clear and simplified diagram; communicates eft'ec- 
ti\'elv to the identified audience; shows understanding of the open-ended 
problem's mathematical ideas and [processes; identifies all the important ele- 
ments of the problem; may include examples and counterexamples; presents 
strong supporting arguments. 

Competent Response . . . Rating = 5 

(H\es a I'airly complete response with reasonai)lv clear explanations; may 
include an appropriate diagram; comntunicates et'fectively to the identified 
audience; shows understanding of the problem's mathematical ideas anci 
processes; identifies the most important elements of the problems; presents 
solid supporting arguments. 

Satisfactory Response 
Minor Flaws But Satisfactory . . . Rating = 4 

Completes the problem satist'actorily. but the explanation may be muddled: 
argumentation may be incomplete; di tig ram may be inaf^propriate or 
unclear; understands the underlying mathematical ideas; uses mathematical 
ideas efiectivelv. 

Serious Flaws But Nearly Satisfactory . . . Rating = 3 

Begins the problem appropriately Ijut mav tail to complete or mav omit sig- 
nit'icant parts of the fDroblem; may fail to show firll understanding of mathe- 
matical ideas and processes; ma\ make maior computational errors; mav 
misuse or fail to use mathematical terms; res* >';e ma\ r(»flect an ina[)propri- 
ate '^trdtegv for soK ing the problem. 

inadequate Response 

Begins, But Fails to Complete Problem . . . Rating = 2 

Explanation is not understandable; diagram jna\ be* unclear; shows no 
understanding of the proiilem situation; mdy make major c ompirtational 
(^rrors. 

Unable to Begin Effectively . . . Rating = 1 

Wouls do not retlect the problem; drawings misrepresent the probknn situa- 
tion; ccjpies parts of the prol)lem but without attempting a solution; fails to 
indi( ale whi( h intormation is appro[)riate to (problem. 

No Attrmpt . . . Rating = 0 
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Figure 5.4 

Linking Criteria to Grades 


A 


y^lUKzIla lUl L^trli,! 1 Illillilg vJialitTs 

Student selects method. 

s.7IU^J-lll ><UUIcUvr3 iw\\i_l3. 

Student determines result so as to answer question. 
Result logically follows I'rom method used to saturate towel. 
Measurements are accurate/carefully done. 
Conclusions are correct. 


B 


Meets all requircvients of an "A" but measuremeni is careless. 


C 


Meets all requirements of "A" but may be deficient in some areas. 
Must attempt to control saturation by [Jutting the same amount of 
water on each towel. 

Towels nor saturated (key dimension for determining a "C" or 
below grade). 


D 


Student fails to saturate towels or control tor saturation. 

Result is logically inconsistent with method used to saturate towels. 


F 


Student did not conduct the investigation 
Or, equipment mani[)ulate(l without purpose 
Or, towels not wet 

Or, conclusions based on how towels felt. 


'Criteria abridged trom Baxlcr v\ aK p. Si. 



Considerations in Selecting Dimensions 

The dimensions you use to assess student performance in a certain 
domain should reflect the essential qualities of good performance in that 
domain. Where do you find these essential qualities? The qualities or 
dimensions can be provided by non-educator experts, colleagues in vour 
department, grade level teachers, district curriculum committees, re- 
search literature, and national, state?, or local subjcH:t anui standards 
committees. If you are creating criteria for your own classroom, focus 
your cTiteria on those aspects of student performance that reflect vour 
highest i)riority instruc:tinnai goals and represent teachable and observ- 
i\h\v. asp(u:ts of ])errormance. 
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One way to uncover dimensions for scoring criteria is to ask yourself 
the following kinds of questions: 

■ What are the attributes of good writing, of good scientific think- 
ing, of good collaborative group process, of effective oral presen- 
tation? More generally, by what qualities or features will I know 
whether students have produced an excellent response to my 
assessment task? 

■ How does completing this task relate to my goals for students? 
What will they do that shows me we are working towards or 
achieving some of these goals? 

■ What do I expect to see if this task is done excellently, acceptably, 
poorly? 

■ Do 1 have samples or models of student work, from my class or 
other sources, that exemplify some of the criteria I might use in 
judging this task? 

■ What criteria for this or similar tasks exist in my state curriculum 
frameworks, my state assessment program, my district curricu- 
lum guides, my school assessment program? 

■ What dimensions might I adapt from work done by national 
curriculum councils, by other teachers? 

In addition to describing your judgments about performance, the dimen- 
sions you use for your criteria need to be written so that all audiences 
who use them will understand them in the same way. Perhaps you are 
judging an interdisciplinary art project designed to reflect social studies 
understanding of the relationship of Native Americans to their environ- 
ment. Your criteria for assigning grades or judging levels of performance 
should be clear to students, parents, and other teachers who depend on 
vour judgments about content mastery, be they others at your grade level 
or those teaching your students next year. 

Clear descriptions of performance dimensions can be achieved in 
several ways: 

1. You could write definitions in terms of the behaviors or elements 
vou will see when judging students. For example, instead of 
saying, "Acceptable performance means students show an under- 
standing of living in harmony with the land," you could say, 
"Acceptable performance means that student drawings depict an 
environment that is almost unchanged from its original state. Few 
trees are cut; grassland is undisturbed except for small sustenance 
patches; no larg(» waste dumps exist, and so on." 

2. You could provido moclols or nxamples for each dimension. This 
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is commonly done in direct writing assessments. Teachers are 
given copies of student essays exemplifying each point in the score 
distribution. The essays illustrate such dimensions as, "the essay 
is well organized; it begins and ends effectively." From these, 
teachers and others can articulate precise definitions of each 
dimension. 

3. If you are assessing informally, you could clarify your dimensions 
as a set of questions. For example, when you are assessing journals 
to see what kinds of help students need in developing fluency in 
vmriting, your criteria for deciding what to work on next could 
include the following questions: Which students are using some 
pre-vmriting strategies such as clustering, drawing, listing, or free- 
writing? Which students are keeping a log of writing ideas? Which 
students are having spelling problems that block the flow of ideas? 

Unambiguous scale definitions usually consist of a description of the 
dimension to be rated, plus examples of student work illustrating accept- 
able responses. Those models or work samples are crucial in developing 
a consensus about the meaning of criteria when used for rater training 
in formal assessments. Models also provide students with concrete 
examples of what acceptable or excellent work can look like. Figure 5.5 
details one of several dimensions in a scoring rubric developed by 
CRESST to assess the depth of high school students' understanding of 
history as revealed in their essays. Note that dimensions and scale points 
are thoroughly operationaUzed: key terms, such as "concept," are de- 
fined and examples of basic points, such as statements of opinion, are 
provided. 

In most cases, your performance dimensions, particularly for class- 
room assessment, will reflect your views of what constitlites excellence 
or expertise and will be moderated by your expectations for students st 
different grade levels and by your instructional goals at different points 
in the school year. Because your criteria help students focus on what's 
important instructionally, you may usu different criteria at different 
times during the school year. For example, while you may feel that 
organization and mechanics are an important part of expressing disci- 
pline-based knowledge in history or science, at the beginning of the year 
you may particularly want to encourage fluency. Thus, your criteria at 
the beginning of the semester will stress the number of ideas presented, 
number of examples or definitions for each idea, and so on. As students 
become more fluent and able to substantiate their views, you can expand 
your criteria to include organization and mechanics. To take an example 
from figure skating, you may believe in the Olympic criteria of "technical 
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Figure 5.5 

CRESST Content Area Explanation 
Essay Scoring Guidelines 

(Baker, Aschbacher, Niemi, and Sato 1 992) 



CRESST Scoring Rubric Scales: 

General Impression — Content Quality 

Number of Principles or Concepts 

Prior Knowledge: Facts and Events 

Argumentation 

Misconceptions 

Text details 

Example of Guidelines for the Number of Principles or Concepts Scale: 
Ni-mber of Principles/Concepts 

This is a measure of the number of different social studies concepts or princi- 
f)les that the student uses with comprehension. 

A concept is an abstract, general notion, such as "inflation." It does not 
refer to particular events or objects (such as one particular period of inflation), 
but instead represents features common to a category of events or objects. 
"Imperialism," for example, does not refer to any specific tacts or events; it is a 
heading that characterizes a class of behaviors and beliefs. "Industrialization" 
likewise identifies a class of activities and events that share common proper- 
ties. It must be clear that the student is using a term conceptually, not just as 
a label. 

A principle is a rule or belief used to justify an action or judgment, as in the 
statement "Slavery is immoral," where "morality" serves as a justifying principle. 

It should be evident that the student understands the concept and means 
to discuss it. The concept should not simf)ly be mentioned within a quotation 
from the text with no indication that the student grasps the concef)t. To earn a 
score point, the concept or principle need not be named explicitly, such as, 
"Constitutionality was an important principle that influenced the debate over 
slavery," but the u i should be stated clearly, for example, "One problem 
was determining what the constitution said about slavery." 

Score point guidelines: 
0 — no response 
1 — no concepts/principles 
2 — one concept/principle 
.5 — two concepts/principles 

4 — three concef)ts/prlncipl('S 

5— four or more conce[)ts/principles 

Example; "One great factor that held us bac k t'rom war was our econo- 
my. It Wtis not known what would happen to our economy without the safety 
of Britain, Britain could defend our commerce and coasts. Also, with Britain 
there was a great tuKantage with ex{)orlation. It s(»c>med our (»conomy could 
only suffer without the aid of Britain." 
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merit" and "artistic expression" but at different points in your teaching 
you may want to differentially emphasize one or the other. 

Dimensions for Complex Tasks 

As we mention in Chapter 4, it is entirely possible to create a complex 
assessment with multiple intended outcomes. Multiple outcomes re- 
quire multiple criteria, a set for each outcome. Multidimensional criteria 
are unavoidable when you are doing interdisciplinary assessment or 
judging complex learning goals. You may either formulate separate 
criteria for each of these outcomes or create a multidimensional set of 
criteria. Connecticut's state assessment in science incorporates two 
approaches to assessing the same task by providing criteria for assessing 
group process and individual accomplishment {see Figures 5.6 and 5.7]. 
Another perspective on student performance is provided by the subskills 
within the individual and group assessments. When examining group 
process skills, we are interested in scientific process, communication, 
and group collaboration. Separate criteria attend to each of these skills. 
The multiple dimensions on the individual scale include content and 
communication outcomes. 

The dimensions for each scale require a lot of inference. Both teachers 
and students would need further descriptions of such dimensions as 
"draw reasonable conclusions" or "collaborate effecti\ ely" in order to 
use the scales. In fact, these scales are used in classrooms only after 
teachers have had inservice training to discuss the meaning of the 
dimensions, review examples, and practice using the criteria. Through 
classroom discussion and examples students and teachers come to a 
mutual understanding of the dimensions of the individual scale. 

A less complex example of multidimensional criteria appears in 
Figure 5.1. The criteria assess four group pt "formance outcomes: col- 
laboration, critical thinking, communication, and history knowledge. 
The criteria include sub-criteria for deciding at which of five perform- 
ance levels we should place students for each outcome. The entire set of 
group process criteria may be viewed as a compendium of four sets of 
criteria, one for collaboration, one for critical thinking, one for commu- 
nication, and one for history knowledge. 

Using Rating Scales 

All samjile scoring criteria included in Ihis chapter contain s(Dme [ypc 
[A sciile. nith(5r numerical, ciualitativc^, or both. The (.rit(}riii in Figurt; 5.1, 
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the history group process, and Figure 5.3, the mathematics problem, 
contain both numerical and qualitative rating scales. Figure 5.4, the 
hands-on science criteria, and Figures 5.6 and 5.7, the group and indi- 
vidual science experiment, have qualitative ratings only, such as letter 
grades or evaluations such as "excellent" or "needs improvement." 

Why scales? How do you know whether to use numerical or qualita- 
tive ratings? What about using a checklist instead of a rating scale? 
Whether you rate the presence or absence of a performance, as in a 
checklist, or use numbers or qualitative evaluations will depend on your 
testing purpose. There are three major types of scales: checklists, nu- 
merical ratings, and qualitative (either descriptive or evaluative) ratings. 
If vour purpose is to describe what students can do, perhaps for parent 
conferences or to compare student performance to certain developmen- 
tal standards, you may be able to use the simplest rating scale of all, the 
checklist. If you need more information than simply whether or not a 
student is engaged in specific aspects of a task, you will need a more 
fullv developed rating scale. When you want to know the extent to which 
dimensions were observed or the quality of the performance, you need 
more elaborate scales. Rating scales, beyond the yes-no checklist format, 
reflect aspects of student performance other than mere accomplishment 
of an activity. 



Checklists 

A checklist is a list of dimensions, characteristics, or behaviors that are 
essentially scored as "yes-no" ratings. A chock indicates that either the 
characteristic or behavior was present or absent. Checklists often contain 
more dimensions to be scored than do rating scales, but those dimen- 
sions are often quite narrow and concrete. 

Checklists can be useful in assessing processes, an important purpose 
for teachers concerned with the how as well as the what of learning. A 
process checklist for a hands-on experiment could resemble Figure 5.8, 
which asks the rater to note the presence of specified behaviors. 

Primary school teachers find checklists useful because they must 
often determine how students are developing according to some theory 
of skills acquisition. For example, current language acquisition theory 
suggests that this skill cluster supports a child's ability to read: 

■ Ability to draw or depict an idea 

■ Ability to recognize sound-letter correspondence 

■ Ability to recognize that words stand for something 

bU 
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■ Knowledge of left to right and up-to-down page orientation 

■ Ability to recall and retell favorite stories 



Figure 5.8 
Process Checklist 


Procedure 


Check if 
Observed 


Comments 


Selected approach 






Correct equipment used 






Measurement accurate 






Sought peer help if needed 






Recorded observations 






Cleaned up after experiment 







The teacher can document acquisition of these readiness skills with a 
checklist. There is no need to judge how well each of these behaviors are 
displayed, only that they are in place. Figure 5.2 demonstrates a devel- 
opmontally-based profile for kindergartnors created by teachers of the 
Soledad Union School District in C .ilifornia, with consultation from 
Pacific Oaks College in Pasadena, California. This is an example of a 
theory-based profile. The profile development process was designed to 
help staff better understand constructivism, the developmental learning 
theory on which it is based. The behaviors identified in Figure 5.2 are 
sequenced from left to right in the order that the kindergarten staff 
predicted that those behaviors are acquired. This document was de- 
signed to be re-analyzed each year as teachers observe children's behav- 
iors from a developmental point of view. 



Numerical Scales 

A numerical scale uses numbers or assigns points to a continuum of 
performance levels. The length of the continuum or the number of scale 
points can vary, three points, four points, five points, seven points — any 
number is possible. How many divisions or scale points should a good 
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scale include? While there^s no single answer to this question, our 
experience suggests that you consider these issues. 

The number of points or divisions on a scale can and should vary 
depending on what decisions you will be making about students and 
whether the scale will be used in the classroom or in a formal scoring 
session with several raters involved in judging performance. In general, 
the larger the scale, the more difficult it is to clearly differentiate among 
the score points. Consider how quickly you can sort essays into stacks 
worthy of zero points, one point, or two points; essentially a decision 
among low, medium, and high. Why use a ten-point scale if you really 
only want to distinguish two or three groups of students, such as those 
who need additional instruction on writing a well-organized essay and 
those who don't? 

A scale with only a few points does have somn disadvantages. More 
scale points enable you to identify small differences between individual 
students and may provide more diagnostic information than a reduced 
scale. For example, a longer scale may be needed if you want to use one 
scale for all students K-12 and you also want to differentiate among 
students in a single grade. Also, if your scale will be used for formal 
assessment purposss where several readers will be rating each perform- 
ance, any statistics you have to calculate, such as rater agreement, will 
be affected by the scale range. Using a shorter scale will result in a high 
percent agreement, but it will be more difficult to achieve a high 
correlation between raters' scores (two different ways of figuring inter- 
rater reliability). 

It takes longer to arrive at consensus about how to assign scale points 
when there are more points to consider. With a five- or six-point scale, 
raters often refer to prior experience and assign the lowest points to 
off-task or truly terrible performances, the highest to stellar examples, 
reserve the middle for **pas'^>ing," "acceptable," or model performances, 
then allocate those not fitting into the three anchor points to the remain- 
ing scale values. An eleven- or seventeen-point scale makes it more 
difficult for raters to anchor their judgments in prior experience. How- 
ever, you will often see scales in multiples of five, such as ten, fifteen, 
or twenty point scales, which allow readers to "chunk" the points into 
five-point intervals. Initial rating distinctions are then really made 
between a five and a ten rather than a four and a seven with examples 
not clearly fitting into the increments receiving the intermeeliate points. 

Another consideration related to scale size concerns multidimen- 
sional criteria. If you are rating the same performance with several 
criteria, each assessing a different outcome, you may want to use the 
same number of scale points for each outcome. Not only dons this make 
it possible to aggregate or compare the resi 'ts of several scales, but it 
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eases the rating task. For example, using a four-point scale for coherence 
and a five-point scale for supporting facts could slow the rating process 
while raters mentally shift to different scale points. Students trying to 
understand their relative strengths and weaknesses can also have diffi- 
culty comparing different scales. However, if you want some outcomes 
to count more than others for a total score, you can use different size 
scales to reflect relative value or weight. A good example of this strategy 
appears in Figure 5,1, the history group process task. The scoring guide 
uses two different scales with one set of outcomes "weighted" up to 
twenty points and the other up to thirty. 



Qualitative Scales 

A qualitative scale uses adjectives rather than numbers to characterize 
student performance. These scales are of two general sorts, descriptive 
and evaluative. Descriptive scales label student performance but don't 
necessarily make explicit the standards underlying the judgment; they 
use fairly neutral terms to characterize performance. Judgments about 
task completion, task understanding, or the appearance of certain ele- 
ments in the performance are typical descriptors. Figure 5,9 provides 
three examples of descriptive scales that do not evaluate the worth of 
student performance. 



Figure 5.9 

Descriptive Scales 



No evidence. ..Minimal evidence... Partial evidence. ..Complete evidence. 

Task not attempted. ..Partial completion. ..Completed. ..Goes beyond. 

Off task. ..Attempts to address ta5k...fV\inimal attention to task.. .Addresses (ask 
but ni) elaboration. ..Fully elaborated and attentive to task and audience. 



Evaluative scales incorporate judgments of worth anchored in under- 
lying standards of excoUencn, The most commonly i'^=^^d ovaluative 
snalos are grades (see Figure 5.4). Scales using descriptors of "oxcol- 
lcnf:e'' (rip,ures 5.1, 5.6, and 5.7) or judging competence (Figure 5. 3 J are 
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evaluative in nature. Evaluative scales require higher levels of inference 
to interpret than descriptive scales. The inferences are made by referring 
directly to the scoring criteria. The criteria themselves embed notions of 
excellence, competence, or acceptable outcomes. 



Numericai-Qualitative Scales 

Nunieric:al scales are often easier for people to remember, to aggregate, 
and to average, but are difficult to interpret in the absence of good 
descriptors. After all, a score of "4" on a six-point scale may connote 
different levels or qualities of attainment to different people. Good 
criteria often include both descriptive and numerical values. For exam- 
ple, Figure 5.3 displays a draft of a scale used by the California Assess- 
ment Program for judging open-ended math problems. Note that it is both 
numeric and descriptive. Performance is rated numerically, but each 
numerical score is attached to an evaluation ranging from "inadequate" 
to "competent." 

Whether your scale values are numerical, dcsscriptive, or both, it is 
important to make sure that scales help parents, students, teachers, 
administrators, and policymakers understand the meaning of the per- 
formance in the same way. This common understanding helps ensure 
reliable and fair judgments. 



The Link with Standards 

Nearly all criteria, even descriptive c:hecklists. are linktul in some way 
to standards— the expectations for student performance. Grades or quali- 
tative ratings reflect teacher judgment, or in the c:ase of the hands-on 
science criteria in Figure 5.4. the consensus of the rating team. The 
standards underlying different scales may reflect either criterion-refer- 
enced or norm-referenced approaches to judging quality. Tlu; niaUnnnat- 
ics criteria (Figure 5.3) with de.scriptors for "inadequate res])onse." 
"satisfactory response, and "demonstrated competence. reflect an 
absolutti standard or mast(^ry approach to standard setting. Th(j (hj.scrip- 
tors clearly in(Ii[:ate good or desired performanc:e levels, "satisfactory 
and abov(5/* vt^rsus poor levels. "inadeiiMate." The levels are ref(Tenc:(Kl 
to discii^line-based standards, mathematics tea[:hers" conceptions of 
adequatfj problem-solving strat(?gif?s. 

Another ()xamj)le is Illinois' six-point writing assessment sc:ale. 
which employs an absolute stiah; ami is designed to be uscul across grade 
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levels. A score of six represents an extremely high level of writing, and 
few if any elementary students are expected to score above a "3." This 
type of scale is especially L^seful in measuring growth over years. The 
limitation of an absolute scale for multigrade/age assessment is that 
because elementary students all tend to score near the bottom of the 
scale, there is little variability in their scores so it is impossible to tell 
much about them individually from their scores. They all "look alike." 

Other evaluative scales reflect norm-referenced approaches to stand- 
ard setting. When grades or points are assigned by comparing students' 
relative status, such as, 'Maria's essay was better than the class average'," 
"Gary's video was among the best in the class," the standards are 
norm-referenced. Developmental checklists or scales demonstrate an- 
other common use of norm-referenced scales in alternative assessment. 
The sequencing of behaviors in these scales rests on what educators and 
others have observed over time to be typical performance at specified 
ages. For example, children who score "average" in reading readiness 
demonstrate behaviors typical for their age or grade level. ''Below 
average" or "developmentally delayed ' refers to performance typical of 
children in a younger age group than those being assessed. 

It is possible to anchor standards in both criterion- and norm-refer- 
enced information for the same assessment. You start with a criterion- 
referenced scale, a scale describing performance relative to a clearly 
defined set of behaviors, then gather or otherwise obtain data about how 
a national, state, or local sample of students performed on the same 
measure. You can then say "Maria wrote a well-organized essay, receiv- 
ing a '4' in organization; her performance was described as better than 
75 percent of the students in the state." Or, on a more informal level, in 
^our classroom, you can always describe an individual student's per- 
formance level in comparison to the rest of the class's performance: 
"Maria s score? put her among the best in the class." 

Some scales may look like absolute or criterion-referenced scales but 
might actually incorporate both norm- and criterion-referenced informa- 
tion. An age- or grade-related scale defines student performance in terms 
of benchmarks or expectations for a particular grade level. Benchmarks 
for 3th grade mathematics problem solving will differ from those for the 
7th grade. What constitutes excellences in essay organization at the 8th 
grade will not do so at the 11th. Despite their "criterion-referenced" 
appearance, scah^s tied to an age or grade level curriculum have an 
underlying norm-referenced interpretation. The? dimensions Ihemse^lves 
wer(^ derived from what students were abh? to do at particular grades, 
not from absolute standards of performance across ag(}s and grades. Vnv 
pra(:lic:al purposes, these gratify level scales art^ considered criterion 
referenced because Ihcnr primarv iisr? is lo (hu.ide what sludenls can do 
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vis-a-vis particular content and skills rather than to compare them to 
each other. 

How can you get the best of both worlds? By determining appropriate 
standards according to your assessment purposes. For classroom or 
schoolwide assessment use, you'll probably lean toward criterion-refer- 
enced or absolute standards. For selection decisions in which there are 
more candidates than available space, you will probably use absolute 
standards for inclusion in the candidate pool, but normative standards 
for the final selection. For example, if you are selecting horn players for 
the after-school honors band, you will choose only the top 2 percent of 
the candidates. 

We have not discussed how standards are set. How do vou know 
where to set the acceptable level of performance? How good is compe- 
tent? What is the cut point between barely satisfactory and satisfactory? 
High-stakes assessments, such as graduation certification, use formal 
standard-setting procedures. These may include using a group of judges, 
provided with norm- and criterion-referenced information, to determine 
a passing score. district or schoolwide assessment, passing scores or 
labels describing poor and excellent performance are determined by 
consensus of those using the assessment. In the classroom, teachers set 
standards based on their experiences, their knowledge of what students 
have done in the past, their familiarity with expectations in a discipline, 
the current performance of students, and the purpose of the assessment. 

Considering Other Choices: 
Holistic or Analytic Criteria* 

Based on experience with direct writing assessment, we offer two more 
choices in specifying criteria: holistic and analytic. Holistic: critijria 
require raters to assign a single score based on the overall qiialitv or to 
one aspect of the student s response. An analytic scale requires that raters 
give separate ratmgs to different aspects of the work. Criteria incorpo- 
rating several outcomes are analytic. 



*V(>ii ni.n I)!' (aiiiili.ii willi (lie Idiin "I'linuMV T'r.nl Stdrinj".*' When l'ni;uiiy 'Ir.iil 
crilnria fnciis on nnh mir. ii^iii. Jh.-v hip liolislif ; wlifii cxpjiMdcd to Iwiuir nion- (raits. lh{'v 
Inn omc anaUiic. 
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Which Is Better? 

By this time you know that we're going to say "it depends on the purpose 
of the assessment." The pattern of results from an anahlic scale provides 
useful feedback about the strengths and weaknesses of the individual 
student and the classroom instructional program. Unfortunately, be- 
cause student performance on different dimensions of an analytic scale 
may be related in complex ways, the results may not be as clearly 
diagnostic as desired. Despite the fact that one of the qualities of a good 
analytic scale, from an efficiency and measurement perspective, is that 
each dimension be distinct, the subscale scores are often highly interre- 
lated and not well differentiated. CRESST research on analytic scoring 
scales found high correlations among scores for overall essay and para- 
graph organization, and between organization, support, and a general 
competence score. Under such circumstances, the diagnostic value of 
subscale performance is greatly diminished. 

Holistic scoring is usually simpler and faster than analytic; an impor- 
tant concern when teacher time is involved. Unless assessment's pur- 
pose is not to provide data to guide program improvement, a quick 
overview^ of achievement may be particularly suitable for program evalu- 
ation, for flagging students who need more help, and for assigning final 
evaluations. 

Concurrent use of analytic and holistic strategies can optimize both 
diagnostic value and efficiency One approach emerging from minimum 
competency testing is to score all essays holistically then rate analyti- 
cally those' essays that were scored beiow minimum competency An- 
other strategy, used in the Maine statewide assessment, is to score essays 
holisticallv. Init to note analytic dimensions that are particularly strong 
or weak in an individual's work as a kind of generic ^comment" on the 
performance. 

Opinions differ considerably regarding the value of these different 
approaches, and research is ongoing. The important point is not so much 
the correct labeling of scales, but that a variety of approaches exist and 
can prove useful. 



What About Portfolio Assessment? 

Pnrtfnlio assossnioiit is often the first strategy that comes to mind when 
p(!ople think nl alternative assessments. In some rf.>spects. portfolio 
assessment is a misnomer for "assessment oi a Uudy of work/* In other 
instances, the portfolio assessment is really the asscissnnuit system. 
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Portfolios are collections of student work that are reviewed against 
criteria in order to judge an individual student or a program!* The 
portfolio or collection of work does not constitute the assessment; it is 
simply a receptacle for work (essays, videotapes, art, journal entries, and 
so on) that may or may not be evaluated. The ''assessment" in portfolio 
exists only when (l) an assessmciM purpose is defined; (2] criteria or 
methods for determining what is put into the portfolio, by whom, and 
.v'hcn, are explicated; and (3) criteria for assessing either the collection 
or individual pieces of work are identified. Deciding what should be 
included is really a task description, not a scoring guideline problem. 
What goes in, who chooses, when samples are taken— these are dimen- 
sions of the assessment task that define the setting and kinds of work 
that will be considered. (See Chapter 7 for more discussion of portfolio 
assessment.) 

There are two issues related to selecting the dimensions of scoring 
criteria for portfolio assessment: (1) What arc the criteria for selecting 
the samples that go into the portfolio, and (2) What are the criteria for 
judghig the quality of the samples? Prior to considering criteria for 
judging portfolios, you will need to determine whether the portfolio 
should be rated as a whole or as individual samples. Second, you need 
to decide which dimensions reflect the intent or purnr-se of vour assess- 
ment. When looking at a body ffwork, many issue, arise, for example: 

■ Will progress or improvement be assessed? 

■ How or will progress be evaluated? 

■ How will different tasks, videos, art work, essays, journal entries, 
and the like be compared fir weighted in the assessment? 

■ What is the role of student reflection in the assessnu^nt? Parental 
input? 

Once these issues are settled, defining the dinumsions of portfolio 
scoring criteria is the same as defining mulli dimensional criteria. Per- 
haps the host known example of portfolio assessment criteria is provided 
hv the Vermont Mathemat ics portfolio, which is summarized in Figure 
.1.1 0. A body of mathematics work i.s evaluated on two major dimensions, 
problem-solving and communication skill. Within each dimension, sev- 
eral suhdimnnsions further define each of the larger skills. Ratings are 
given hiT thn subskills under the two dimensions, problem solving anfl 
conununiration. Vf)u can s(?e how this exaniph^ of porlloli(» asM^ssmenl 
criteria resfMiibles the multiflimensional (?\ imples in Figures 5.1 and 5.7. 
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Developing and Evaluating Scoring Criteria 
Beginning the Development Process 

The process for developing your own criteria is straightforward: 

■ Investigate how the assessed discipline defines quality perform- 
ance. 

■ Gather sample rubrics for assessing writing, speech, the arts, and 
so on as models to adapt for your purposes. 

■ Gather samples of students' and experts' work that demonstrate 
the range of performance from ineffective to very effective. 

m Discuss with others the characteristics of these models that dis- 
tinguish the effective ones from the ineffective ones. 

■ Write descriptors for the important characteristics. 

■ Gather another sample ^ students' work. 

■ Try out criteria to see if they help you make accuiate judgments 
about students. 

■ Revise your criteria. 

K Try it again until the rubric score captures the "quality" of the 
work. 

You probably aotirod how recurrent this development process is. Initial 
ideas about important and scorable aspects of student performance 
become refined through use. Your criteria may focus on process — how 
a student approaches and solves a problem — as well as on the product 
or outcomes. 

For example, we can refer to the development process for the criteria 
in Figure 5.5 (Baker, Aschbacher, Niemi, and Sato 1992). CRESST 
developed its rubric for rating depth of content understanding in history 
by collecting and examining the differences in essays written by history 
experts (university professors and graduate students in history) versus 
those written by novices (high school students). CRESST researchers 
looked for dimensions that seemed to differentiate the performance of 
these two groups. In a number of subject areas, the researchers observed 
differences between the students and the experts in the application ol 
prior knowledge, the use of organizing concepts and principles, and 
misconceptions. These traith^ defined the first draft of scoring criteria. 
The criteria wore then tried out on samples of student work and further 
clarified and « rfined to ensure that the scales were clearly defined, were 
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appropriate for the range of student responses likely to be encountered, 
and enabled teachers or other raters to distinguish between essays that 
deserved adjacent points on a scale. 

While undertaking the task of developing criteria, don't forget to take 
advantage of others' work. Quite often you can import or modify criteria 
from state and local assessment programs, curriculum experts, or col- 
leagues who have grappled with similar assessment problems. Research 
literature on alternative assessment also provides examples of pilot 
alternative assessments similar to the one appearing in Figure 5.4, which 
can be adapted for classroom use. There is also a small but growing 
iitoraturn on the nature of expertise in various disciplines, such as how 
an historian reads and uses primary source documents. 



Evaluating Criteria 

Your criteria for judging students' work shape the decisions you eventu- 
ally make about programs and students. Regardless of whether you are 
developing your own criteria or using those provided by others, it is 
important to review the quality of the scoring guidelines. We conclude 
this chapter with a proposed set of "criteria for criteria" — a checklist you 
can use to rate the quality of scoring criteria you borrow or develop. Our 
proposed criteria appear in Figure 5.11. 

Now h^t's look at a set of dimensions for assessing the worth of your 
own criteria. 

Keyed to Important Outcomes 

At a minimum, criteria for judging student performance need to address 
all the student outcomes you are trying to measure. P'or example, your 
criteria for judging student drama productions should encompass all the 
ini[)ortanl drama and art that you want to be able to assess, and no others. 
If originality and logical presentation are part of the desired outf omes, 
you wi 11 want to include scales for judging those aspects of studt iit work. 
If they cwo not an important outcome, omit them. 

Sensitive to Purpose 

What echicational decisions will you make on the basis of your asses.s- 
ment? The answer o this question should guide your decisions about 
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whether to use a checklist or rating scales , how many scales , which traits, 
what types of scale, and so forth. Do you need a global, holistic view of 
student achievement or an analytical one that gives you information 
about several specific aspects of students achievement? Do you need the 
information in the form of a number for ease of reporting and aggregation 
at the expense of detail, or do you need the richness of qualitative 
description, or perhaps both? 



Figure 5.11 

How Do You Evaluate Scoring Criteria? 

3 All important outcomes are addressed by criteria. 

□ Rating stratej^y matches decision purpose: holistic for global e\'aluative 
view; analytic for diagnostic view. 

n Rating scale provides usable, easily interpreted score. 

3 Criteriti employ concrete references, clear language understandal)le to 
students, parents, other teachers. 

1 Criteria reflect current conceptions of "excellence" accepted in the field. 

3 Criteria have been reviewed for developmental, ethnic, gender bias. 

3 Criteria reflect teachable outcomes. 

3 Criteria are limited to feasible number of dimensions. 

3 Criteria are genera I izable to other similar tasks or larger performance 
domain. 



Meaningful, Clear, and Credible 

The criteria by which you judge a performance need to be meaningful to 
students, parents, raters, teachers, administrators, policymakers, and the 
public. If the criteria are not credible, the results will probably be ignored 
or may be misused. Examples of student work that illustrate criterion 
traits can help make the criteria concrete for others. Involving others in 
the development of criteria increases their credibility. 

Bncaiisf! one of the tenets of performance assossmnnt is public and 
discussed criteria, your crittiria need to make sense to students so that 

. 'Jo 
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they will be able to apply them easily to their own work and become 
self-regulated learners. Although judgments of student performance 
tend to be subjective by their nature, they are more reliable and credible 
when they rely less on high inference and more on observable, concrete 
characteristics. 



Fair and Unbiased 

Not only do assessment tasks need to be fair, but so do the criteria by 
which you define excellence. Unrecognized biases can seep into your 
definitions of traits, your specifications for what kind of performance 
earns which scale point, and your application of those criteria lo indi- 
vidual pieces of student work. When you want your criteria to have 
diagnostic value, they must be sensitive to instruction and students' 
opportunities to learn the skills that are assessed. In contrast, you do not 
want them to reflect variables over which educators have no control, 
such as a child's culture, sex, or socioeconomic background. 

Feasible 

Several reasons exist to limit the number and complexity of the perform- 
ance dimensions to be judged. First, the time, effort, and money available 
for judging performance are always limited, sometimes severely so. 
Second, raters find it difficult to address too many different aspects of a 
work at once. In our experience at CRESST, raters were frustrated w-hen 
asked to use more than six or seven scales for rating student essays. It 
became an onerous task and a less reliable process. Third, students will 
probably find it difficult to deal with too many aspects of their work at 
once. And finally, administrators and policymakers usually need infor- 
mation in as brief a form as possible. Separate scores for a large number 
of traits or for complex characteristics may make it more difficult to use 
the results effectively. 

Generalizable 

Although we recognize that criteria for performance are strongly linked 
to discipline-based notions of excellence, rating can be more efficient 
when a single set of "generic" criteria can serve multiple topics, tasks, 
or disciplines. For example, we could develop a common set of criteria 

9o 
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for assessing student understanding of science concepts through jour- 
nals, hands-on experimentation, computer simulation, and oral presen- 
tation. We could also use a common set of criteria for judging student 
essays in social studies, science, and math? As disparate as these situ- 
ations may seem, it is possible to envelop generic criteria for some 
purposes. If we could conceptualize excellence in consistent ways across 
assessment methods and disciplines, our criteria could have a more 
powerful impact on learning and instruction. Our example of the 
CRESST history-social studies rubric (Figure 5.5), which has also been 
applied to science and economics, shows one strategy for developing 
cross-discipline criteria. Like all good criteria, these proposed dimen- 
sion are subject to revision and refinement. 
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Ensuring Reliable Scoring 



A fundamental feature of performance-based assessment is its reliance 
on human judgment. As any trial lawyer will attest, two people viewing 
the same occurrence or reading the same document often come up with 
conflicting perceptions or interpretations. Likewise, persons viewing the 
same behavior on different occasions may arrive at different judgments 
about that behavior. The user or developer of alternative assessments 
must seek to minimize such differences; otherwise the measures cannot 
be fair, consistent, or valid. Sound scoring procedures help the process. 



Understanding the Importance of 
Reliability and Consistency 

The most obvious reason for consistent scoring is equity. To be meaning- 
ful, judgments of student performance cannot be capricious. You need 
to have confidence that the grade or judgment was a result of the actual 
performance, not some superficial aspect of the product or scoring 
situation. Was Yuki's grade unduly influenced by her spelling? Did Mark 
get a better (or worse) grade because his project was graded near the end 
when you were tired? How was Jamais grade affected by the fact that 
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another teacher did part of the scoring? What eibout Corinne? Did she fail 
the competency writing test this year because the raters were more 
stringent than last year? 

Inconsistency is especially troublesome when the results influence 
important decisions about students or programs. What grade does Den- 
isha deserve? Should Marta be allowed to take the Advanced Placement 
English class despite low standardized test scores? Should the school's 
new math program continue? Even when the results of a single assess- 
ment do not carry high stakes, inconsistency means inaccurate scoring. 
More to the point: inconsistent scoring means the scores have little 
meaning. If an "A" doesn't consistently represent excellent performance, 
then what does it mean? The best in the class? The best of a poor lot? 
Improved effort? If a performance or project receives different scores 
from different judges, what does each really mean? Which one is accu- 
rate? If you apply criteria differently depending on how long you've been 
scoring, what does the final set of scores mean? What does an individ- 
ual's score mean? 



Achieving Consistency 

Equitable and meaningful scoring requires informed and consistent 
judgment. How do you avoid capricious subjectivity? As we discussed 
in Chapter 5, having well-defined and defensible criteria for judging 
student performance goes a long way toward achieving consistent scor- 
ing, but there are other conditions that must be met to ensure consistent 
scoring. First, those making judgments — you, teacher colleagues, the 
state department of education — must thoroughly understand the criteria 
in a similar fashion. A consensus among raters about the meaning of ihe 
criteria and how they are to be applied builds the foundation for scoring 
consistency. Second, you need a system for monitoring the consistency 
of ratings over the period in which performance is being judged. This 
consistency has several facets. Two or more judges rating the same 
performance should have general agreement. One judge should rato a 
particular performance in much the same way regardless of when it is 
observed — whether during the beginning of the day, somewhere in the 
middle, or near the end. Judges should rate the same performances 
similarly on separate occasions. And, the same performances rated on 
two separate occasions by two different group of judges should be rated 
similarly. If your scores are used to make high-stakes decisions such as 
promotion, graduation, or special class placement, you should formally 
document evidence of scoring consistencv. 
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Professional Development Benefits 

The process by which judges learn to apply scoring criteria in a consis- 
tent manner can provide a valuable opportunity for professional devel- 
opment. Rater training helps teachers come to a consensual definition 
of key aspects of student performance. This can lead to a reprioritization 
of classroom goals as well as insight about the strengths and weaknesses 
of their students' performances. The scoring process can provide a model 
for classroom assessment and encourage more collaboration among 
teachers in the appraisal of student outcomes. 

To reap the benefits of consistency and professional growth, you will 
need good training procedures and a carefully structured rating process. 
This chapter outlines major considerations in devising and implement- 
ing a valid scoring procedure. Although the process we describe has its 
origin in formal, high-stakes assessment; at the district and state level, 
keep in mind that consistent scoring applies to all forms of assessment, 
be they classroom grades or college admissions. Decisions about a 
student can't be valid unless based on reliable information. 



There are a number of ways to achieve consistency. Our approach 
emphasizes training raters to a conmon standard because this approach 
is efficient and provides teachers with instructionally useful informa- 
tion. Other approaches devote less attention to rater training and con- 
sensus-building and rely on multiple judgments of student work to 
achieve a similar result. As you might expect, the approach you choose 
depends on your assessment purpose and available resources. 

During rater training, judges learn what the scoring criteria mean, 
what aspects of performance each is intended to capture, and what each 
of the scale points represents. It is during the training session that vou 
make sure raters apply the criteria consistently to a range of student work 
samples. This is also the time when raters learn how to record their scores. 



Training Manuals 

Formal scoring manuals can be very helpful both during and after 
training. For large-scale assessments, such as yearly district or state 
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testing programs, a scoring manual provides an *Mnstitutional memory" 
of assessment procedures and serves as a useful reference for interpreting 
scores. For high-stakes classroom assessments, such as Advanced Place- 
ment "screening" examinations, or an algebra readiness test, scoring 
manuals can be useful in discussions with parents or students who want 
to know how scores are achieved or improved. Typical scoring guides 
include: 

■ Fully explicated scoring criteria; 

■ Examples or models illustrating each score point; 

s An abbreviated, one-page, version of the criteria or reference 
during actual rating; and 

■ A sample form for recording scores. 

You might want to review- training manuals from several sources before 
designing your own rater training. If you are interested in a detailed 
description of the rater training process, a complete scoring manual 
developed by the Riverside Publishing Company appears in Educational 
Performance Assessment, edited by Fred Finch (1991). State depart- 
ments of education are also sources of published scoring manuals. 



Training Procedures 

Actual rater training is designed to create a consensual understanding 
of the scoring criteria, provide extensive practice in actual scoring, and, 
in the case of high-takes assessment, document acceptable levels of 
scoring consistency (reliability). During rater training, practice scoring 
sessions provide raters immediate, substantive feedback about their 
judgments and ample opportunities to ask questions. Raters also come 
to understand that their job is to make a judgment based on the scoring 
rubric, not to revise or criticize the rubric and then follow their own 
inclinations. Without such an understanaing, an entire assessment en- 
terprise can be sabotaged. 

A typical training session includes: 

■ Orientation to the assessment task. Raters receive an overview of 
the assessment context, what the results will be used for, who will 
use them, Vv^hat directions and prompts the students received, and 
how the scoring guide operationalizes desired outcomes or proc- 
esses. It is common to ask raters to actually take the test as a means 
of orienting them to the scoring task. 
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■ Clarification of the scoring criteria. In this phase of training, 
raters engage in extensive discussion. Both the criteria dimfM;- 
sions and scale values are defined and a range of models provided 
to exemplify each. Discussion often moves from simpler judg- 
ments, such as which samples illustrate high, medium, or low 
performances, to more difficult distinctions required for assign- 
ing numerical scores, 

■ Practice scoring. This is the heart of the rater training process. At 
first, sample assessments are scored one at a time with discussion 
following each paper. As raters become more fluent with the 
scoring guide, they get opportunities to exercise more difficult 
judgments with problematic (atypical) or borderline assessments. 

■ Protocol revision. During the discussion and practice scoring, 
raters naturally devise certain rules for dealing with the unantici- 
pated aspects of judgment posed by a particular set of papers and 
not covered by the scoring guide. For example, when almost every 
student has misinterpreted the test prompt in the same fashion, 
rather than to score all answers as "off topic" or "unacceptable," 
raters may decide to assign scores based on th(^ student-defined 
task. Or, if many traits are to be scored, raters may decide that 
different raters should specialize in scoring a few of the traits 
rather than having all raters score every sample on everv dimien- 
sion, 

■ Score recording, p'or all assessments, student scores must be 
recorded in some fashion, on the roll sheet or on summary sheets 
for a classroom, grade level, or school. Rater training covers the 
format for recording scores and any special procedures for calcu- 
lating student scores such as averaging and totalling across di- 
mensions, 

■ Documenting rater reliability. Rater training ends when there is 
agreement that scorers have reached an ac:ceptable level of con- 
sistency, usually rating sample pieces within one point of each 
other. In order to determine when raters are ready for the real 
thing, reliability checks are conducted during training. Figure 6.1 
provides an example of how to check rater consistency using the 
percent agreement method, 

■ Scheduling Considerations. How much time will it take to train 
raters to an acceptable level of agreement before letting them 
judge student work? It depends on: 

— How experienced your raters are. 

— Whether they are familiar wiih your scoring criteria. 
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— How quickly raters come to consensus about the meaning of 
the criteria, 

— The complexity of the scoring criteria, and the quality of the 
work to be judged — with borderline work being the most 
difficuU to assess quickly. 

We have found that it takes about three to four hours to train raters to 
use a holistic or simple (two- to four-trait) analytic scale. More complex 
scales can require up to a full day of training. 

Rater fatigue is an important factor in scoring; we consider a six-hour 
session a full day's work. You should also schedule time for retraining 
or refreshing raters at the beginning of each new scoring day, and 
certainly for any changes in topics or tasks that use the same scoring 



Figure 6.1 

Calculating Rater Agreement 
(Three raters for two papers) 
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Figure 6.1 illu<itrale> the rase in which three raters are asked to rate ts\o t riterion papeis 
alter some training. According to the rosuh'i in the t'igurc. Linda agrws witti the criterion 
store tor paper 1 hut not tor paper 2: in tact, tor paper 2 she is not even within one point 
ot the criterion score. Robert is not in perl'ett agreement with the criterion scores on either 
paper t or paper 2 but is in agreement plus-or-minus one score point on both papti:^. Ella 
IS in agreement all the time and is ready to rate student work. Robert and Linda probablv 
need a little more training. Paper 2 causes more problems tor raters than paper 1 . so 
further training should toe us on distinguishing the criterion score from neighboring scale 
points. In reporting these result'i you could say, "On aserage. raters obtained perfect 
agreement with criterion stores 50 percent of the time, and reached ±1 agreement 
percent of the time. 
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critoria. In high-stakes assessment, retraining often takes place after any 
lengthy hroaks such as lunch. 



Training Paper Issues 



Because rater training provides a dry run for actual scoring, it behooves 
you to anticipate as many possible sources of rater disagreement as 
possible before rater training and to build opportunities into the training 
papers for eliciting disagreement and discussing it. For example, the 
syntactical constructions used by non-native English speakers raise 
issues related to balancing content with communication concerns. You 
should also deal w^ith handuaiting and legibility issues or aesthetic 
quality concerns in visual and performing arts. Finally, you want to be 
sure that the sample papers you select for training represent not only 
each point on the score distribution but also the entire range of student 
performance likely to be encountered in scoring. The natural human 
tendency is to grade normatively. The better work samples from a set of 
relatively poor papers may receive higher scores than they would were 
they part of a stack of relatively good papers. The reverse can also be the 
case. This tendency should be discussed during rater training with 
examples provided so that the scoring criteria maintain the same mean- 
ing across different sets of papers and different scoring occasions. 



Obtaining Sample and Check Papers 

Because a wide array of sample work is needed to guide raters, you 
should collect samples from a diverse group of students. Pick work from 
a field-test, a previous assessment, or from the actual assessment. To 
identify appropriate training and check papers, a group of "experts"— 
teachers from the grades and subjects involved who are familiar with 
your scoring criteria— can be quite helpful. They can select examples 
that illustrate the range of responses, from clear to borderline, for each 
score point so that raters will be trained to handle all situations. If several 
prompts or tasks are used in the assessment, examples need to be drawn 
for each. If you are using age-related scales across grade levels, you need 
examples to illustrate each age level. It is also useful to prepare comment 
sheets explaining how the specific aspects of each piece of work repre- 
sent criteria for a particular score. The expert group can then identify 
samples that wall be used for (1) training discussions, (2) practice, and 
(3) checking consistency. 
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Score Recording Concerns 

You need to provide raters a method for recording student scores. In your 
own classroom , you might simply record scorns at the top of the student's 
paper and then in your roll book. Some teachers use the scoring criteria 
as ci feedback sheet for students. They circle deficient areas or note 
strengths using the descriptors on the guide. The same process can be 
used to create a classroom profile on one mavSter scoring guide. 

In more formal assessment settings, score sheets become a matter of 
public record and are used to provide feedback to teachers and others. 
Data analysts also use them to calculate test statistics, hi these instances, 
raters are often given machine readable documents for ''bubbling" in 
student scores as well as other important information such as the school, 
district, student, and rater identification numbers and the code numbers 
for topic or task and date. Whenever you have two or more raters scoring 
student work, you'll need to remind them not to indicate scores, com- 
ments, or corrections on the sample itself. You don't want a subsequ(jnt 
rating influenced by their comments. 



Refiability Issues 

The purpose of rater training is to create consistent, reliable scoring 
procedures. Thus, a method of determining if raters are consistent 
should be built into the training period. Many strategies for checking 
rater reliability exist. One commonly employed approach is to prepare 
in advance and score a set of ten or so "reliability check" papers 
representing the range of student performance. Ask the raters to score 
this same set and compare their judgments with you or others who are 
trusted assessors. Reasonable agreement with both the expert judgments 
and with each other suggests that raters are ready to score actual student 
work. 

What constitutes reasonable agreement? You can ask that all raters be 
in exact agreement before you consider them reliable, or you can use the 
less stringent "plus or minus one" rule, which is fairly common and says 
that raters are "in agreement" if they agree within one scale point, "plus 
or minus." For example, if the score on a particular reliability-check 
sample is a "3," anyone who gave it a rating of "2", "3," or "4" is 
considered to be on target. 

Regardless of the target level of agreement you choose, when you train 
r. ^ers, the goal is to have them apply the scoring criteria exactly as 
inteiiHed, not to within one scale point of the target score. When a rater 
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has difficulty applying tluuTitoria uxaclly us inlondod, you sliouhJ spond 
tinu) during training discussing Iho practices papors, crit(M'ia, and deci- 
sion rulos for applying the criteria in ordcjr to bring the rater up to an 
acceptable level of consistency. However, some raters may not be able to 
adjust their internal t riteria to match the scoring guides, Th(\se aberrant 
scorers should be dismissed or assigned to other tasks during actual 
scoring. 

In addition to (i(u:iding how close ratings shoidd be to (istablish 
consistency, you nei^d to think about how often they need to b(^ in such 
agreement, If you are asking for exact agreement, which can be difficult 
to obtain, your criterion for reliability may be less stringent than if you 
art) using the "plus or minus one" rule. At C^RbJSST, wo often ask thai 
raters agree with the experts at least 00 percent of th(^ [iuu) on each 
scoring dimension wdien using the "one point off" guideline. Thv. guide- 
line for exact agreeuKint could drop tcj 75 to 80 percent und(ir the more 
stringent condition. The actual ptu'ctuitagt? of agniennmt varies depijnd- 
ing on the assessment purpose and stakes involved. 

Regardless of how you define "rater agreement," the purpose of 
n^liability checks is to ensure that student scores aren't the result of 
caprit:ious judgment, one of the most conunonly cited arguuKuits agauist 
performance assessment. CIonsid(ir the classic study conducted b\ Paul 
Deidrich (10H3) at th(; Educational Testing Service in which th(^ sam(^ 
essay was assign(jd an entire rang(^ of scortus by a group of rat(^rs. What 
most don't remember about this study is that acceptable levels of rater 
agreement were obtained when the judges (1) were drawn from the same 
discipline, (2) used explicit scoring criteria, and (3) participated in a 
training session. 

Ensuring Equitable judgments During 
an Actual Scoring Session 

Maintaining Consistency 

Documenting rater consistency during training is simply the first step 
toward creating a fair, equitable scoring process. Because the purpose of 
rater training is to develop rater consistency, you need to monitor rater 
scoring patterns during the actual scoring process as well. Research 
shows that raters have a tendency to drift away from formal criteria to 
their own. more idiosyncratic views (Quellmalz and Burry 1983). Hu- 
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nicin judgincnts aiui oxpoctations aro sli()])(i(l not only by ibrinal stnnd* 
ards, such as scoring critciria, l)ut also by tluiir pvinv axpvsnm.v. and \\w 
aclual range) of ])orfbrmanco currently bcung assossod. If tho niitiro sut uf 
jKirfonTiancns appear to be relatively '•poor" according to the obj(ictive 
criteria, raters develo]) a tcuidency to shift the criteria downward so thev 
can award higher scores to the "bcjst of the worst" paj^ers. As a Unk Ixm'. 
yon too ha\'i; pe^rhajos been aware that your standards and expectations 
for st\idenls change during the grading proct^ss. Yon niodily your ideas 
somewhat after lookingat several piectvs of student w ork. For tiiis revison. 
training sessions need to include a large sample of ])ap(n's and the entire 
range tliat might he encountered during actual scoring. 

For classroom asstissnient pur])es()s, you can check your consistiMicv 
b\' stopping midway and rescoring some of th(^ first student work you 
scored. When you are scoring several difterent dimensicjns (jr to])i(;s, \'()U 
can scon) all work on one dimension or related to aui\ to|)ic at [\u) sain(> 
time, then go bark and score for olh(>r factors. vScoring all ])a])ers s(>veral 
tim(\s, once for each different dimension or topic, is often ([ui( ker tlhm 
going through individual papers for everything at once and a])])lying 
muhi])le criteria or reading different kinds of r()S]K)ns(?s. ^'our scoring 
pact* also increases as you beconu) familiar with th() crit(MMa. 

For school-level, larger-scale, or high-stakes assessment, you'll want 
to build in more Ibrnial rater consistency checks. For essay scoring this 
is s()inetim(»s done by burying ])re-sc:ored comnujn check papers al 
d(\signat(Kl intervals in :;ach rater's stack nf papers. 'I'he scoring dinn tor 
lh(m checks raters on the common paptjr aiul works with those who have 
drifted away from a consistent application of the scoring guide. Another 
method is to conduct mini-training sessions first thing in the morning 
or right after lunch. Raters score a common set of check pajjers. much as 
they did in training. Those who have drifted from the ]jn»set standard 
(exact agreem()nt: plus or minus one point) particijiatc^ in a nn-icw 
session and are rechecked before being allowed to continm; scoring. 

An additional consist(»ncy consideration in large-scale assessment 
relate.^ to lack of bias in rater judgments. You need to be sure that raters 
working together don't form subgroups who agree with each other but 
not all the other participating ratters. To avoid this, break u]i rater groujis 
at periodic intorvals and have second ratings of papers/work done by 
raters assigned to other tables or physical locations. 
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Managing Logistics 

Although achieving consistent judgment is the overriding concern of 
scoring, conducting a scoring session involves a number of logistical and 
technical issues. Scheduling is one of the most fundamental concerns in 
plaiming a scoring session. As people tend to tire in the afternoon and 
rate more slow^ly, you might consider scheduling your rating sessions 
early and avoiding the late afternoon. Access to a copy machine enables 
you to address any unanticipated shortages of rating materials or to 
reproduce papers that require discussion during the rating session. 
Further, rating is an intense activity; provide frequent breaks and snacks 
(lots of fruit and carbohydrates, little sugar). The scoring area itself 
should be quiet and comfortable w^ith ample room for raters to accom- 
modate the work to be reviewed. A rater's nightmare is to work in the 
gym on folding chairs and tables at 3:30 on a hot May afternoon during 
band practice. 

Another concern is managing the flow of papers or other student 
products. In large-scale assessments, each table of scorers should have 
their own leader whose sole duty is to manage the paper flow and 
monitor rater consistency. Our experience suggests that bundles of 
student work that take about one hour to rate are easier for raters to 
handle than individual pieces. The number of pieces in each bundle will 
vary with the nature of the task and the complexity of the scoring scheme. 
In writing assessments, for example, sets often consist of fifteen to 
twenty-five papers, whereas a bundle of portfolios might include only 
four to six. Regardless of how w^ork is bundled, individual pieces must 
be randomly assigned to bundles and bundles randomly assigned to 
raters so that no systematic scoring effects occur. For formal assessments, 
both raters and students should be assigned identification numbers to 
guard against bias and protect privacy. 

You'll need to decide whether to mix different grade levels or differ- 
ent topics together in the same scoring session. Generally, this is not done 
unless the purpose of an assessment is to compare students at different 
grade levels on the same scoring scale. In large-scale assessments, 
different topics are either assigned to different rater groups or scored 
separately from each other with a session of refresher ti fining preceding 
the topic change. 

Another concern that can cause problems late" if not monitored 
carefully is ensuring that scorers are recording required information 
properly. Were all identification numbers bubbled in along with the 
scores? Were scores recorded for all papers rated? Do all students have 
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scores? The list is extensive. Try to anticipate what can go wrong and 
devise strategies for either preventing it from happening or for fixing it. 



Ensuring Technical Quality 

Advice on all the technical decisions you have make to ensure S'^oring 
accuracy and equity is beyond the scope of this book and in fact 
constitutes a psychometrician's career. If you are assessing for a high- 
stakes decision, especially if that decision can get you sued, disparaged 
on page one of your local newspaper, or called before the board of 
education, you may want to bring in a technical consultant to structure 
your scoring process and help you document the reliability of student 
scores. Following are some of the questions you need to address: 

How many raters are needed? This, of course, depends on how much 
work is rated, how many ratings each piece will receive, how long it 
takes to rate each piece, and how many days are available for scoring. 
Holistic scoring of one-to-two page essays generally goes quickly, some- 
times as quickly as a minute a paper. A complex analytic rating on longer 
pieces can take four to five minutes per paper. Portfolios can take longer 
still. As for the number of days, our experience suggests raters can get 
quite burned out after four or five days. 

How many scores per paper? Effective training and vigilant monitor- 
ing of the scoring process can eliminate much of the need to do multiple 
scoring of the same dimension of student work. Multiple raters are 
needed for each paper when raters are inexperienced or there is little 
evidence that raters are using the same criteria and standards in making 
their judgments. The need for multiple scores depends on your assess- 
ment purpose. The more serious the consequences, the more important 
it is that you document consistency. Our experience suggests that no 
more than two raters are needed for any piece; the ratings can be summed 
or averaged to provide a fin. ' score. A third opinion can be called in for 
difficult cases, such as the occasional nightmare paper that draws both 
the lowest and highest scoie. 

In some situations, one score is sufficient for a majority of the pieces. 
Consider a situation in which selection, placement, or othsr critical 
decisions about individual students will be made based on some prespe- 
cified standard or cut score. If your training and scoring check papers 
show that raters are consistent, the only papers requiring two or more 
ratings will be those borderline papers falling around the passing score. 
Because rating is an expensive process, you will need to balance reliabil- 
ity concerns against those for cost and efficiency. 
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How are papers scored for evaluation purposes? If student scores 
will be used for program evaluation rather than individual assessment, 
a reliable estimate of an individual student score is less critical than the 
average score for the task. Most pieces of work can be read only once, 
and your reliability evidence can be obtained on a sample of work 
(perhaps 20 percent), which is rated by two or more raters. If you are 
using student samples to evaluate a program and don't have to provide 
individual scores to teachers, it is more efficient to score a randomly 
selected sample of student work. Your technical consultant can advise 
you about sample size and the appropriate manner of selection. 



Providing Evidence of Reliability 

For high-stakes assessments, you need to formally document the consis- 
tency and reliability of your scoring process. Plan to invest in the services 
of a technical expert in advance of the scoring to ensure that you have 
an adequate scoring design, that you are collecting suitable evidence, 
and that your data are appropriately formatted to ease data analysis. 
The following are some relevant sources of evidence: 

■ Results of the qualifying check after training. Plan to report on 
what agreement level was required. What proportion of your 
raters passed on the first try? What was the average level of 
agreement among those passing? 

■ Results of the consistency check during scoring. Plan to report 
on what agreement level was required. How many and when were 
the checks made? What proportion of your raters passed without 
remediation? What was the average level of agreement on the 
checks? 

■ Inter-rater reliability results for student work scored by more 
than one rater. Percentage agreement among raters and generaliz- 
ability coefficients are two frequently used techniques. Each of 
these is calculated separately for each scale you use. As a guide, 
you need to double score at least 20 percent of your student 
samples to get sufficient evidence, and if more than two raters are 
involved, you need to consult a statistician for help with a 
balanced design specifying which raters are to score which pieces 
of student work. 

What level of agreement or reliability is high enough? Of 
course the answer is: it depends on the decisions you are making. 
The more critical or restrictive the consequences are, the more 
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reliable your scores need to be. In general, reliability coefficients 
of .70 and above are considered respectable. Coefficients of ,90 
and above are not uncommon with standardized multiple-choice 
tests, and large-scale direct wTiting assessments. 

■ Rater consistency across years. When you want to be sure that 
your rating scale is consistent from year to year — for example, 
when results are being used in state assessments to track trends 
over time — you need to include with this year's scoring a suffi- 
cient sample of student work from last year's scoring. Agreement 
in scores assigned can then be checked, and if necessary, statisti- 
cal adjustments can be made for differences. 

■ Rater consistency across different locations or diiRFerent groups 
of raters. Similar to checking consistency across years, if student 
w^ork is to be scored at a number of different locations or by 
different groups of raters, you need to check on the consistency 
of these different groups. For example, a state might convene four 
regional workshops to score its hands-on science assessments, or 
a district assessment might require each school to score its own 
students' work. One way to check for consistency would be to 
seed the work scored by each group with a common set of work. 
At scoring site one, for instance, scorers would assess student 
work assigned specifically to site one plus the common set; site 
two scores would assess student work assigned specifically to site 
two plus the common set and so forth. Scores on the common set 
can then be checked for consistency. 

■ Inter-rater consistency. This is the degree to which one rater 
remains consistent over time. Check for this by having raters score 
the same piece more than once at different points in the scoring 
process. 



Checking the Reliability of 
Your Rating Process 

As a summary of many of the issues covered in this chapter, use the 
following checklist to see if your scoring procedures are sound and 
reliable. Do you have: 

[ ] documented, field-tested scoring guide 

[ ] clear, concrete criteria 

I ] annotated examples of all score points 
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( ] ample practice and feedback for raters 

[ ] multiple raters with demonstrated agreement prior to scoring 

[ ] periodic reliability checks throughout 

[ ] retraining when necessary 

[ ] arrangements for collection of suitable reliability data 
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7 
■ 

Using Alternative 
Assessment for 
Decision Making 



We have considered a number of important issues in the development 
of good ahernative assessments: What is alternative assessment? How do 
we identify suitable assessment tasks? What should criteria include? 
What do sound scoring procedures look like? We now turn to the reason 
we've developed alternative assessments in the first place: to make 
appropriate decisions about students and programs. 

This is a critically important point: assessment is not an end in itself. 
Rather, assessment provides information for decision making about what 
students have learned, what grades are deserved, whether students 
should pass on to the next grade, what groups they should be assigned 
to, what help they need, what areas of classroom instruction need 
revamping, where the school curriculum needs bolstering, and so forth. 
Good assessment enables us to accurately characterize students' func- 
tioning and performance and to make sound decisions that will improve 
education. 

Does using the results of an assessment contribute to good decisions? 
This is the crux of how we judge the quality of an assessment, Policy- 
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makers and the public have placed considerable faith in standardized 
tests, in their quality, and their efficient ability to lead us to accurate 
conclusions about students and schools. Unfortunately, some believe 
that this faith in testing has been misplaced. As we have become more 
sophisticated consumers of assessment, v^e have raised more questions 
about v^hat these tests actually tell us. Do Scholastic Aptitude Test scores 
really predict which students will be successful in college? If not, how 
much weight should they be given in college admission decisions? Do 
state assessments give schools the kind of information they need to 
improve their programs? Do they help policymakers and the public know 
whether students are learning v/hai they need to know and be able to do? 
Do multiple-choice tests allow students to demonstrate their full under- 
standing of a subject? If not, how much should we be relying on them 
when makmg decisions about students and programs? 

Dissatisfaction with traditional tests has encouraged teachers and 
entire states to embrace alternative forms of assessment. But alternative 
formats alone cannot guarantee good assessment. We need to apply to 
alternative assessments the same scrutiny that allowed us to see the 
limitations, as well as the strengths, of more traditional tests. We need 
to be sure that the assessments we plan to use are helping and not hurting 
students, programs, and schools. 

This chapter highlights issues that should be considered when using 
assessments, alternative or otherwise. We begin with an overview of two 
key concepts in assessing the quality of any assessment: validity and 
reliability. We then examine three major questions guiding appropriate 
use of assessment information: 

1. How does your decision context and intended use influence your 
concerns for the quality of your assessment program? 

2. Fiow do you ensure that an assessment is giving you good infor- 
mation for decision making? 

3. How can you use your assessment results to improve instruction? 

Note that we address issues of assessment quality before we provide 
concrete examples of how to use assessment results. We do this to 
emphasize that assessment quality is always an issue and should be 
considered before actually using results. If an assessment does not 
provide good information for decision making, its use may constitute 
misuse. 

Before venturing further, we remind you that for purposes of simplic- 
ity throughout this book we have examined issues from the perspective 
of a single assessment. No doubt you are well aware that no single 
assessment or test constitutes a sound assessment strategy. All assess- 
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ments, even the very best, are imperfect and fallible. Alternative assess- 
ments, like all assessments, should be used in concert vt'ith other sources 
of information to constitute a systematic and balanced assessment pro- 
gram. As you read about factors influencing test use, keep in mind that 
the same concerns that apply to an individual assessment are applicable 
to a collection of assessments or an entire system of assessment. 

Issues in Ensuring Quality — 
Validity and Reliability 

iDoos an assessment ])rovide accurate information for decision making? 
Do its results permit accurate and fair conclusions about student per- 
formance? Does using the results contribute to sound decisions? These 
are the central issues in judging the quality of an assessment. If we v^^ish 
to answer these questions in the affirmative, our assessments must be 
both reliable and valid— terms the measurement community uses to 
address these same concerns. 



Reliability: Stability of Performance 

Earlier, we introduced the concept of reliability as it relates to the 
consistency of human judgments. We have seen that there are several 
ways to ensure acceptable levels of rater agreement about student per- 
formance. However, reliability in the larger sense refers to whether test 
scores retain their meaning (remain consistent) despite superficial 
changes in the assessment situation— from one day to the next, regardless 
of the person judging the performance or the day or time at which 
assessments are scored. If Maria writes a critique of Tristam Shandy 
today, tomorrow, or next Tuesday, we expect her performance to be 
essentially the same on all three occasions. If her teacher reads her paper 
tonight, tomorrov/, or next Tuesday, we expect the teacher to give her the 
same grade or to dra\i^ the same conclusions about how^ her skills have 
developed and about her strengths and weakness. If Byron is able to 
create two approaches for answering a mathematics problem today, wo 
expect him to be able to come up with a similar analysis of a similar 
problem on Friday or next week. Without such consistency, we cannot 
say with any confidence that we know what a student can do. An 
unreliable test score is useless because it does not tell us anything 
meaningful or goneralizable about vStudent performance. For this reason, 
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we must ensure that our results are reliable before we concern ourselves 
with validity, the issue closest to test use. In fact, most of us at one time 
learned the maxim, **to be valid, a test score must be reliable." When 
asked to recall this truism, many of us aren't sure whether reliability 
precedes validity or vice versa. Perhaps the easiest way to keep the 
relationship straight is to remember that for a score to be valuable 
(validity) for decision making, it must be repeatable (reliable). 



Validity: Accuracy of Test-based Conclusions 

Measurement specialists know that although reliability is necessary, it 
is not a sufficient condition for validity — in other words, whether a test 
score yields accurate conclusions about a student's performance and is 
subsequently a sound basis for decisions. A tost result could be perfectly 
reliable but not very relevant to the decision for which it is intended. To 
take an extreme example, a test for typing or word processing may give 
you highly reliable (repeatable and consistent) information for judging 
a student's keyboarding skills and speed, but these results are useless for 
making decisions about the student's writing ability. Similarly, a multi- 
plication test may give you highly consistent results about your students' 
computation skills, but be of limited use in determining whether they 
are successful problem solvers. 

Determining the validity of an assessment depends on how you plan 
to use it. Throughout this book, we have used the word **validity" 
somewhat loosely, as though it were a quality or characteristic of a 
particular test. In fact, assessments themselves are neither valid nor 
invalid; their validity depends on the purposes for which they are used. 
We assess the validity of a test by determining whether or not a conclu- 
sion based on the test score is accurate for a particular use or purpose. 
For example, if we wish to use the results of a test to identify students 
who have mastered linear equations we ask, *'Do the scores identify all 
the students who have mastered linear equations?" or *'Do the students 
identified as those who need more help actually have such a need?" More 
precisely, when we speak of the validity of the test to identify masters of 
linear equations, we are really referring to the evidence we have that tells 
us our score-based conclusions are correct, that students who score at or 
above our passing score have actually mastered the content. We have 
little reason to use results and can have little confidence in doing so until 
we have corroborating evidence, such as student performance on sub- 
sequent assignments, performance on similar kinds of assessments, 
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teacher observation, and other teacher judgments that support our score- 
based conclusions. 

Because it is somewhat unwieldy to repeat this precise definition, we 
shall continue to use "validity" to stand for "evidence to support score- 
based inferences." As you are reading, keep in mind the more accurate 
definition. 

Remember, too, that assessments can be valid for some purposes but 
inappropriate for others. For example, a survey test of basic skills 
provides useful comparisons with a national sample but may be rela- 
tively worthless for pinpointing mastery of local curricular objectives. 
The results of a final exam may be valid for determining whether a 
student should receive an "A" or a "B" in a class, yet it may not be valid 
for identifying the students who w^ould benefit most from accelerated 
instriicUon or liie select few who can partn^ipate in the new gifted 
program. The lesson here is that if a test claims to have multiple uses, it 
should be accompanied by evidence to support each separate use. What 
kind of evidence is that? The next section provides you with things to 
think about when determining what kinds of formal evidence you w^ill 
want to consider when using assessments to make decisions about 
students, classrooms, or schools. 

How Does Decision Context and Intended 
Use Affect Concerns for Quality? 

Know Your Assessment Purpose 

Assessments are created to provide information for making decisions 
about students, classrooms, schools, districts, states, and national edu- 
cation goals. What is the purpose of your assessment? What audiences 
will use the results? What other information will these audiences use to 
reich conclusions or to make decisions? The answers to these questions 
hav'e serious implications for what content should be included in an 
assessment, how it should be constructed, and how much attention 
should be given to ensuring its quality. 

Consequences Make a Difference 

It's clear that some decisions about students and schools carry more 
serious consequences than others. High-stakes tests carry serious consc- 
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quences. Low-stakes assessments have less serious impact on individu- 
als. Those include assessments used to monitor progress, plan instruc- 
tion, even grade courses (if a variety of scores and other evidence will 
be used to constitute the grade). The higher the stakes associated with 
an assessment, the greater the need to document its quality — its validity 
and reliability. 

Gather Corroborating Evidence for Decision Making 

Even in low-stakes situations, errors can compound and cause great 
harm. The accumulation of your unit tests and other classroom assess- 
ments send important messages to students and parents and can have 
significant impact on them. Likewise, informal judgments of school 
quality based on assessment results can affect faculty morale and prac- 
tices over time. Thus, validity warrants your attention regardless of 
whether your assessment context is high or low slakes. 

Identifying early on whether your assessment is high or low stakes 
will help you determine how much evidence you need to document the 
quality of your assessment. What are the consequences of tost perform- 
ance? Will assessment results be used with lots of other corroborating 
information to make decisions about students? Will it be nearly the sole 
basis for a decision? If a score-based decision is incorrect, is it easily 
fixed? Could you be sued? If an assessment carries serious consequences, 
as do nearly all those used for accountability, placement, or funding 
purposes, formal evidence of validity for intended purposes is essential. 

Evidence of Validity: How Do You Know an 
Assessment is Giving You Good Information? 

Concerns for assessment validity are threaded throughout this book, so 
some of the issues highlighted here will sound familiar. As should be 
clear, the quality or validity of an assessment for a particular purpose 
depends on several issues and requires consideration of a variety of 
evidence. Those interested in greater technical detail and in techniques 
for gathering corroborating evidence may be interested in Standards for 
Educational and Psychological Tests (1985). These standards serve as 
the touchstone for test quality whenever an assessment is called into 
question during a lawsuit. Adhering to them provides you with some 
assurance that any assessment you might use that could result in litiga- 
tion will be defensible. 

lib 
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Can the Scores Be Used To Describe What Students Have 
Learned? 

One of the primary usos of assessment is to imd out what students know 
or have h^arned with regard to particular instructional goals. Validity for 
such a purpose requires a good match between those goals and the 
content of the assessment. The following questions will help you deter- 
mine whether there is such a match: 

■ l5 the test accompanied by a clear definition of assessment goals 
so that you can judge the match between skills and knowledge 
intended for assessment and those emphasized in your class or 
school? 

■ Does the content of the assessment reflect the most important and 
full range of content in your curriculum? Is there a good match 
between the task dor,cription and your instructional priorities? 

m Do the assc^ssment tasks require the kincis of knowledge, thinking, 
problem-solving, and process skills that are addresstul by your 
instruction? 

■ Does the assessment tap complex thinking skills? Whicli one.s? 

■ Does the assessment include scoring criteria? If so, do the crit(jria 
match instructional goals, current learning theories, and curricu- 
lum priorities? 

■ Do the criteria include standards for judging the adequacy of 
student performance. If so, how were these standards deter- 
mined? 

■ Is the task devolopmontally appropriate? Does it reflect processes 
and outcomes suitable to the intended students? 

■ Have students had sufficient opportunity to learn what's included 
in the assessment? 

When you answer questions like these in the affirmative, you have some 
evidence that your assessment results will lead to accurate conclusions 
about how^ well students have achieved instructional goals and about 
how effective your instruction has been. 

If you want additional evidence of the validity of your test on these 
dimensions, you might ask a colleague to review your assessment and 
either pose the same set of questions or a less directive set such as the 
following: 

1. What do .you think this assessment measures? 

2. What will this assessment tell me about my students in terms of 
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my goals? Our school's performance standards? Important stiuhMit 
outcomes? Student strnugths and weaknesses? 

3 Is this typo of assessment what yon would have visualized lo 
assess my goals? 

4, What might a typical response to this assessment look like? 

Still mor{j formal evidtnice results if you convene a panel of suhjoct 
matter experts and ask them lo rate your assessment on the same 
questions of curriculum match. For a high-stakes test, such as your state 
assessment, you should look for such evidence. 

In reviewing your assessment for validity in these areas, he aware of 
the limits of "face" validity. While the task on the surface may appc^ar lo 
assess desired outcomes, until you see the actual student responses, you 
cannot be compk^tely c'-kar about what you are measuring. What knowl- 
edge and skills do students actually use to respond to your assessment? 
The only w^ay you know whether the assessmcmt really assess(\s your 
intcMided goals is lo gather evidence corroborating the Ujst score inter- 
pretation. We could collect this evidence through observation, careful 
review of student pc^rformance. or debriefing students about what skills 
and knowledge they used to address the assessment task. For example, 
if your assessmnnt is designed to judge a student's ability to make 
connections between Hamlet's ptirsonulity and other historical figures, 
you cannot be completely sure that the n\sponses represent critical 
thinking and extension of concepts to new contexts. To know that your 
assessment is yielding valid results, you need to reassure yourself that 
students have not rehearsed and memorized answers, used some pub- 
lished analysis of Hamlet, or answered this question previously. 

Once you determine that your assessment reflects intended goals, you 
can entertain the important issue of how well the particular test score 
typifies a student's achievement. 




Student? 

An important issue in the validity of performance assessments for any 
purpose relates to whether you can generalize from a student's perform- 
ance on one task to the next. After all, we teach for transfer. We want our 
sUidents to possess enduring knowledge and skills. Therefore, we hope 
and often assume that student performance on our assessment tasks 
generalize to a larger domain and that the results of an assessment 
represent how students will perform on a larger set of tasks. After all. 

12u 
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whon wo givo stiidnnts a hands-oii scior ^ assc^ssmiuit involving silk 
vnrms, probably don't earn as much whethor students are ahln to do 
thcji spnci-.Lj silk worm experiment as wo do about their skills in using 
the scinntific niotbod. 

This issue of transfer and gencralizability ap])ears to bo a problem 
area in alternative assessment, where available time constrains \\w- 
number of tasks that students can complete. What tasks, skills, content, 
and performances need to be included in an assessment to ensure that 
it generalizes to the larger domain of interest? How many samples of 
student ptTformance do we need hc^fon^ w(^ can make these goneraliza- 
tions? We don't know precisely, but the answer, unfortunately, is sub- 
stantially more than on(^ 

For example, Herman (1991) reviewed th.e rcisoarch on writing assess- 
nnuit and found that writing skill doesn't generalize across genres. Mon^ 
specifically, students who write good persuasive essays don't necessarily 
wril(» good stories or literary critiques. Further, (iven within a genre, 
students' i)(M'forniance may vary substantially depending on the topic or 
prf)mpt. Tlu^se findings suggest that despite the intuitive validity of 
performance tasks and the extent to which they meaningfully engage 
students, idternativt? assessnuMits may not necessarily lead to more valid 
inherences about larger performance domains. In other words, there 
appears to be a trade-off betwtMMi depUi and breadth of information 
provided by such assessments. 

How do we know whether the results from a student's assessment 
represent some larger, meaningful domain of performance? We gather 
evidence of g(meralizability by looking at the consistency of student 
performance across tasks that are intended to assess the same knowledge, 
skills, and dispositions. Technically, wo can perform special statistical 
analyses that quantify the relationship between performance on one task 
and another, then use the decision rules for particular statistical tests to 
decide if we should have confidiMice in the results. While tne appropriati^ 
analyses to use arc well beyond the st:ope of this book, be aware that in 
high'Stakes setangs involving mandated tests, you want statistical evi- 
dence. Formal evidence should be presented to answer the cpiestion: 
Based on this one task, how accurate is niy decision about a student? Or, 
even more useful is the question, how many tasks similar to this one 
must a student perform in order for me to make a decision with any 
assurance of accuracy? 

Recognizing that it is impractical to do complex statistical analyses 
on most classroom assessments, we can still improve the validity of our 
inferences about students by using as many observations or work sam- 
ples as possible before making general statements or drawing conclu- 
sions about a student's performance capabilitv. 

I2i 
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Can the Scores Be Used to Diagnose Students' 
Strengths and Weaknesses? 

Can the scores be used to diagnose the strengths and weaknesses of the 
curriculum? Another validity issue central to classroom and school uses 
of assessments is their diagnostic utility. Do the results tell you anything 
meaningful about why students performed as they did? 

If you wish to use scores to diagi.ose student strengths and w^eak- 
nesses, the tasks and scoring criteria must be built on some credible 
learning theory of skill or knowledge acquisition. Let's look at what 
happens when a supposedly "diagnostic" score doesn't refer back to 
supported theory. In recent times, if a student's writing was judged to be 
inadequate, teachers would focus on teaching prerequisite skills such as 
grammar, mechanics, and paragraph structure. Research on the writing 
process discredits this discrete skills approach along with the diagnostic 
value of counting grammatical and mechanval errors as indicators of 
writing quality (Braddock et al. 1963, Elley et al. 1976). We can cite an 
analogous example in the area of mathematics. While the automaticity 
of calculation helps students do well in mathematics, it may be that 
mastery of fractions, decimals, and long division does not'enhance 
student performance in algebra. In short, the pre-algebra diagnostic tests 
inflicted on most 8th grade stuuents in this country are based on faulty 
theories of algebra readiness. These examples illustrate the formidable 
c>hallenge in creating diagnostic assessments as well as the caution we 
must exercise when looking for diagnostic information from our own 
assessments. 

In previous chapters we stressed the need to link task descriptions 
and criteria to current theories of curriculum and learning. This theo- 
retical grounding also provides a link between desired outcomes and 
necessary prerequisites. Diagnostically valid assessment provides evi- 
dence of a body of research that supports the link between particular 
diagnostic scores and underlying theory. 



Is the Score Unbiased? 

Another critically important validity concern in classroom and school 
assessment is one of fairness and bias. Recent cognitive learning theory 
underscores the importance of background knowledge in solving prob- 
lems. It's clear that students from different socioeconomic, cultural, and 
linguistic backgrounds may possess different kinds of prior knowledge 
and experience. Do students have sufficient background knowledge to 

12 ^ 
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engage successfully in the assessment task? Does the content or context 
of the assessment give unfair advantage or disadvantage to children from 
different cultural or language groups? Is it equally meaningful and 
motivating for students of diverse backgrounds? Does the assessment 
contain culturally insensitive material or stereotyping? Answers to ques- 
tions such as these provide one line of evidence about the bias or fairness 
of assessments. 

Problems of differences in background knowledge can be minimized 
if we are sure that all students have ample opportunity in school to 
acquire required knowledge and skills. Teachers must ensure that what 
is being measured has been taught and that students have had the 
opportunity to learn relevant content and apply desired processes. Many 
authorities believe that evidence of opportunity to learn ought to be 
collected routinely in high-stakos testing situations. We want to be sure 
that all students have at least had an equal opportunity to learn. 

A variety of statistical analyses can be conducted to examine potential 
bias. Such analyses essentially look for differential performance among 
subgroups, controlling for various factors. While few teachers or school- 
based practitioners will be called on to conduct such analyses, they 
should be aware that such analyses exist and should be available for 
high-stakes, mandated tests. 



Is There Corroborating Evidence that the Assessment Serves 
its Intended Purposes? 

As should be clear at this point, demonstrating that an assessment is 
valid for a purpose requires gathering specific data to show the relation- 
ship between the results of the assessment and that purpose. For high- 
stakes, mandated tests, this means there should be specific studies 
investigating the meaning of the test scores (Shepard 1 991). For example, 
if the results of a statewide mathematics portfolio assessment are used 
to identify school level strengths and weaknesses, then the state testing 
program needs to gather evidence that the scores can be used in this way. 
Or, if we claim that the senior portfolio, exhibition, or thesis demon- 
strates a student's critical thinking and expressive abilities as well as 
mastery of certain content, we need independent, corroborating evi- 
dence of this score interpretation. Similarly, if we use the results of an 
assessment to determine who gets into algebra, wo need independent 
evidemce (jf the relationship between the content of the test, algebra 
readiness, and subsequent course performance. 
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Does the Assessment Have Positive Consequences 
for Learning and Instruction? 

The current controversy about traditional standardized tests should 
teach us an important lesson: We need to be vigilant about the conse- 
quences of an assessment. Good intentions do not ensure beneficial 
results. Test-based accountability was intended to help improve schools 
and their effectiveness with students. To many, over-reliance on multi- 
ple-choice testing has hurt the educational process and detracted from 
meaningful teaching and learning. 

We want to make sure that our new assessments help rather than hurt 
schools and the people within them. For high-stakes, mandated testing 
programs, this means continuous attention to the actual effects of pro- 
grams and formal studies to evaluate the effects on curriculum, teaching, 
and student learning, among other intended and unintended conse- 
quences. For a teacher in the classroom, it means attention to the effects 
of assessment, for example: 

■ What values are implied by the assessment? Does it encourage 
thoughtfulness and accuracy rather than impulsivity? Multiple 
solutions versus one right answer? Does it honor diversity? 

■ Is the time students and teachers spend preparing for this assess- 
ment well spent? 

■ Are the outcomes worthwhile? Are students held to a high stand- 
ard? Does the task call for complex, rich, challenging use of 
students' minds? 

■ Are the tasks authentic and meaningful for students? Can stu- 
dents see connections to their own lives? 

Reprise: Ensuring Reliability and Validity 

To repeat, we want to have confidence in the quality of an assessment 
prior to using it. Figure 7.1 summarizes some of the strategies discussed 
in this and previous chapters that contribute to such confidence. 

How Can You Use Assessment Results 
to Improve Instruction? 

Although we've traveled an arduous path to get here, we have finally 
arrived with high-quality assessments, appropriate to our intended uses. 
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Figure 7.1 

Building Reliability and Validity into Alternative Assessments 


Tpct r)»*vplnnnnpnt 

Stage 


Strategies for Ensuring Valid Score Inferences 


Identifying testing 

gOclls 

( 
i 


• Link goals to important curricuiar objectives related 
to transferable or fundamental content, skills, 
processes 

• Create clear, unambiguous goal statements 


Creating task 
descriptions 


• Create fully developed task description 

• Review task description against goals 


Selecting/ 
de\ eloping 
criteria 


• Review criteria against goals and underlying learning, 
instructional, and/or curriculum theory 

• Ensure criteria reflect teachable goals 

• Ensure criteria don't favor a particular gender, 
ethnicity, language background 


Scoring 
performances/ 
products,-^ 
processes 


• Classroom use: score systematically and recheck 
work periodically 

• Score like topics or like dimensions at same time 

• Large-scale use: train raters, monitor consistency 

• Document the several kinds of reliability (intrarater, 
interrater, across topic, occasion, for students over 
time) 

• Ensure minimal levels ot reliabilitv (each Kind that s 
appropriate) and a reliabilitv coefficient of at least .70 
for most assessments, .90 for high-stakes tests 


Using alternative 
assessments 


• Limit inferences from scores to the use for which th? 
assessment uas developed or for which vou find 
multiple sources of e\'idence that the score can be 
used in a particular way 

• Find evidence to support score-based inl'erences in 
the test manual, research studies, from colleagues 

• Check inferences from test scores against other kinds 
of information, your prior experience, other scores, 
other work student does, observations 

• Never make an important decision based on only 
1 one score 
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How will we use them? Most often, we'll use assessment results to 
answer two basic questions: 

■ How are we doing? 

■ How can we do better? 

We seek to answer these questions at any number of levels, from answers 
about individual students to those about the school, the school district, 
the state, or even the nation. For example, at the individual level: How 
is Kang doing in mathematics? And depending on our answer, how can 
we help him improve? How is Clarissa doing in science? And what does 
that mean about which course assignments will be of most benefit to her 
next year? Or at the class level: How did my students do in oral 
expression? What does that tell me about the strengths and weaknesses 
of my instruction in that area? Does a particular group or the class as a 
whole need remediation? Or at the school level, how did the 5{h grade 
do in various types of writing? What do the findings suggest for the 
strengths and weaknesse^s of our curriculum and instructional materials? 

In the following sections we discuss basic approaches to answering 
each of these familiar questions. 



How Are We Doing? 



Setting Standards 

Implicit in "how are we doing" questions are concerns for quality and 
standards. We want to know not only how students are doing, but more 
important, are students accomplishing intended goals? Are they per- 
forming well? Are they performing as well as we expected? In a nuishell, 
''are we doing well—or at least okay?" 

How do we determine the answer to such questions? Ideally, in 
formulating your scoring criteria, you also considered standards for 
performance. For example you decided that a "5" meant excellent and 
a "3" meant minimally passing. If this is the case, you can answer the 
"how are we doing" question by referring to the standards in your scoring 
criteria. If your criteria are descriptive and do not include performance 
levels, this is the time to equate specific score points to performance 
standards. There are two basic types of standards or comparisons: 
absolute and relative. Absolute standards hold sway when we decide 
how well students are doing by referring to some criterion of adequate 
performance. Sometimes this criterion is set formally by a school or 
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district; sometimes it's a discipline-based standard. Mathematicians 
agree on what should be included in mathematics solutions. English 
teachers concur on standards for clearly written summaries. Social 
studies teachers know what evidence is acceptable in supporting a 
political position. We refer to these standards when answering such 
questions as: Was Leticia able to write an effective research report? Was 
Judd able to estimate the costs of establishing a new restaurant? 

We can also use relative standards to judge how w^ell students are 
doing. Relative standards are simply those that compare your students' 
performances to other groups of students. Comparing students to the 
national norm (for example, the score at the 50th percentile achieved by 
a national sample of students) is a common example of a relative 
standard. Experienced teachers commonly compare their students to 
other groups with whom they're familiar when judging student perform- 
ance. They may have a pretty good idea of grade-level performance and 
typical student behavior based on past classes, or on comparisons with 
colleagues' classes, or even with the results of state and national assess- 
ment data. Relative standards help us answer such questions as: Did the 
new materials seem to help this year's students do better than last year's? 
Are John's literacy skills developing at an acceptable rate compared to 
developmental norms? Are students in the interdisciplinary curriculum 
doing as well or better than those in the regular curriculum? If it's a grade 
we are assigning, we often use relative standards by comparing current 
performance to other students' past performance levels. 

While sometimes useful, relative standards have serious limitations. 
Their value is limited by the similarity of the groups being compared. 
For example, it would be unfair and inappropriate to compare the 
performance of special education students on a standardized test to that 
of a typical national norm group hnm which most special education 
students have been omitted. Likewise, the ranking of countries in inter- 
national test comparisons to draw conclusions about the quality of a 
nation's educational system are misleading when various kinds and 
proportions of students take the test in different countries. The average 
test scores in one international assessment came from 75 percent of the 
17-year-olds in the United States, but from only the top 9 percent of 
17-year-olds in West Germany, and the top 45 percent in Sweden. 

A word should be said here, too, on another kind of relative stand- 
ard — the practice of "grading on a curve," in which teachers decide at 
the beginning of a class that the top portion of students will get A's, the 
middle will get B's, and the bottom will get C's or D s with no further 
definition of w^hat level of performance is expected for each grade. This 
kind of relative standard merely ranks students. The problem is that 
although Kenny and Leila score higher than anyone else and receive A's, 
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they may not have learned enough of the content or may not be able to 
perform well enough to be worthy of an A according to an absolute 
standard of performance quality. Likewise, if the teacher and materials 
are good enough, the entire class may be able to do very good work 
deserving of an A. The point is that while relative standards have their 
place, the value of absolute standards is often overlooked. Telling stu- 
dents you are grading on a curve suggests to them that it's enough to be 
better than someone else, that n^hoever is in the lower third is second 
rate, regardless of their efforts and yours, and that absolute standards of 
what constitutes acceptable or excellent work are not important. 

Applying standards is part of the unconscious process people use to 
make judgments. Both absolute and relative standards present useful 
approaches for determining how^ well students are performing. In fact, 
absolute standards often incorporate relative information. How do we 
know that students have to get 80 percent correct on the laboratory 
procedures test to be successful in chemistry? Because from our experi- 
ence, we have found that most successful students have scored at least 
80 percent on the science laboratory procedures test. In most instances, 
you will answer the "how are we doing" question by referring both to 
absolute standards and appropriate reference groups. 

Using Test Results To Make Decisions 

Once you establish whether you wish to compare student performance 
to absolute standards or relative standards, you can select from several 
techniques for summarizing your assessment results. As you use these 
summary procedures, keep in mind that there is much about student 
performance the score does not reveal. Any summarization process 
creates a trade-off between economy and rich description. We believe 
that the descriptive information provided by aUernative assessments is 
one of their most compelling attributes. However, there will be occasions 
when you will need to communicate results as numerical summaries. 
There are three basic ways of presenting the numbers. You can present 
thern as a distribution of scores; by giving the average, median, or mode; 
or by showing the percent of students reaching some absolute standard. 

How you summarize depends on the kinds of comparisons you want 
to make and whether your scoring criteria include only one dimension 
(scale) or several. 
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Summarizing a Single Dimension 

Let's look first at the simple case, a holistic or single dimension scoring 
system. 

Distribution of Scores 

To look at the range of student performance on one dimension, simply 
calculate how many students received each possible score. You can even 
make a sketch of the score distribution, using either the raw number or 
percent of studonts attaining each score. A picture of class performance, 
such as Figure 7,2. shows us whether most students are scoring high, 
low, or somewdiere in the middle. This may be particularly helpful wdien 
vou have no preconceived notion of how students wall perform. You can 
use such graphs to monitor how well you succeed with students from 
one year to the next. Researchers call the initial measurement "baseline 
information." 

Figure 7.3 illustrates the distribution of student performance (on one 
scale) on two different history essay topics CRESST has used in research. 
Note that the graph shows us that more students scored higher (3.5 to 
5.0) on the immigration topic than on the Lincoln-Douglas topic. What 
might such a finding suggest about the relative strength of instruction in 
these two topic areas? 

Average score. Another way to look at how* well students are doing 
is to calculate numerical summaries of class performance using mean 
(arithmetic average), median (half scoring above, half below), or mode 
(most frequentlv occurring score). These numerical summaries show us 
how the hulk of the students are doing. They provide a useful shorthand 
for communicating with others. 

If a colleague asks how^ your students are doing in oxidation-reduc- 
tion equations, vou can use summary statistics to answer — "On an 
8-point scale, they average 6.8." Your colleague can form a mental 
picture of where the majority of the students seem to cluster and compare 
that performance to her class, to last year's students, or to her under- 
standing of what the criteria say a "6,8" student is capable of doing. 

Percent-reaching standard. If you are using an absolute standard, you 
can decide which score point represents mastery or you might use a 
two-tiered standard of adequate and exemplary performance. For exam- 
ple, on a 5-point scale, 3 might be considered sufficient for mastery in 
the first system. In the two-tiered system, a score of 3 might represent 
adequate performance and a score of 4 or better might he required to 
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Figure 7.2 
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reach the level of exemplary performance. Thus you might report that 
10 percent of your students achieved a score of 4 or more, reaching the 
standard of exemplary performance and that an additional 50 percent of 
students scored 3, reaching the standard of adequate performance. This 
might be presented in a pie chart to illustrate what proportion of students 
fell in each category (see Figure 7.4). As with average scores, percent- 
mastery data from one year or group can be compared to that from 
another year or group. 

Trends over time. Regardless of whether you look at distributions, 
averages, or percent of students reaching a standard of performance, you 
may want to keep track of trends in performance over time. You can ask 
vourself, "Did the same proportion of my class this year receive high 
scores compared to last year's class?" "Was this year's class average above 
or below that of last year's?" "What proportion of this year's seniors 
reached the 'exemplary performance' level compared to last year's?" For 
an individual student you might ask. "How does Justin's score on this 
persuasive essay compare with his September, November, and February 
persuasive essay scores?" These longitudinal comparisons help you put 
the performance of present students into perspective. 



Figure 7.4 

Percent of Students Reaching Performance Standards 
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Summarizing Several Dimensions 

If you have several dimensions of performance to summarize, you have 
a couple of choices: (1) you can add the scores together or average 
them — both methods give the same overall picture of what happened, or 
(2) you can present separate graphs, averages, or percents for each 
dimension. 

If you add or average scores, you may want to weight some dimen- 
sions more heavily than others if they are more important to vour 
instructional goals. For example, although you may rate student writing 
on grammatical conventions, style, and coherence, you may decide to 
give the coherence dimension extra weight — for example, multiply these 
scores by 1.5 or 2 — compared to grammar and style when presenting an 
overall summary of student work. 

There are certain trade-offs involved in avt^raging or adding together 
multidimensional criteria. While you can form a general picture of 
student performance, you have to realize that average scores can hide 
widely different kinds of performance. For example, some students with 
an average score of 7 may have very good problem-representation skills 
but very poor problem-solving skills, whereas other students may score 
7 on all dimensions. If you need to see such distinctions in the score 
results to inform instructional decisions, you may want to present the 
results for each dimension or for certain key scales separately. 

We can also ask the "how arc we doing" question with regard to each 
separate dimension. For example, in my math assessment task, how arc 
my students doing in communication, in applying math concepts, or in 
using formulas? One useful strategy for dealing with multidimensional 
outcomes is to look at the proportion of subscales where student per- 
fonnance was adequate or above. In our three-subscale examples we 
could summarize our results by looking at what percentage of students 
received an adequate or higher rating for one dimension, for two, and 
for all three. Figure 7.5 provides an example of this strategy. 



Samples of Student Work 

Regardless of how you choose to present your findings — whether on a 
single dimension, average of several dimensions, or as a collection of 
several distinct dimensions — and whether or not you present trends over 
time, samples of student work help illustrate your results and inform 
your decisions. Numbers alone don't tell us everything we need to know. 
We don't want to reduce everything to numbers and lose the richness of 
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Figure 7.5 

Summarizing Multidimensional Criteria 
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sludciil responses. More important, we don't want to lose sight of tho 
quality of student puiformance and what quality work means. 

In considering the "how are we doing" question, you could select 
performance samples that represent the best, tho average, and the poorest 
levels of performance. These exemplars communicate clearly to other 
teachers and often to parents the range of performance and w^here 
particular students fit. If you keep a file of best papers or even of 
exemplars of poor, adequate, and fabulous, you can watch how the 
general level of performance for each particular group progresses. Does 
the excellent lab report of five years ago seem only average now? If yes, 
then we are doing our jobs well. Does the average group-constructed 
newspaper on the "Lives of the Romans" of previous years appear to be 
exceptional when compared with today's products? VVc can then con- 
clude we have work to do. Actual performance samples can serve the 
same purpose as numerical summaries when making informal decisions 
for classroom purposes. 



How Can We Do Better? 

We belioA'e the primary purpose of assessment is to provide feedback for 
improving individual student achievement, classroom instruction, and 
school programs. If after investigating individual or group results, we 
find we fall short of desired goals, we need to identify strategies for 
improvement. Diagnostic assessment identifies the kinds of changes 
needed if we hope to do better by looking at both the patterns and the 
process of performance. 
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Understanding Student Process 

For alternative assessments to answer this ''how can we improve" 
question, we must build into our tasks and criteria opportunities for 
observing and documenting student processes as well as outcomes. If we 
wish to understand how to help students make better group presenta- 
tions, we need results related to how the presentations were planned, 
how roles were assigned, and how students collaborated to accomplish 
the task. The key to diagnosis is understanding the causes or precursors 
of performance. While we can never be entirely sure of what instruction 
causes which results, we need to consider some educated guesses, better 
known as hypotheses, about how adequate^ or excellent performance is 
constructed. To do this, we need to know how a specific performance is 
produc:ed. 

Often diagnostic: information is gathered separately from the out- 
comes assessment. The quickest and richest source of process informa- 
tion is simply to watch students as they perform a task and, in 
appropriate circumstances, interrupt individuals from time to time tn 
ask: What did you do to get to this point? Why did you do that? WHiat 
might you do next? We can even ask students to record in journals their 
reflections about their work in progress; or perhaps circulate among 
students as they work and write quick notes for future references. Other 
times we might hold debriefing or in-progrcss conferences with students 
then summarize results in our anecdotal records. 

At the school level, student process t:an be monitored in a variety of 
ways: (1) formal classroom observations. (2) videotaping, (3) scripting. 
(4) peer reviews, (5) teacher-student conferences, or even (fi) document 
analysis^ a procedure for collecting and reviewing key classroom items — 
syllabi, assessments, sample lesson plans, sehicted student work sam- 
ples, and student or teacher portfolios, 

We can analyze this process information by looking for patterns 
related to outcomes. Did successful students approach \hi) task in sig- 
nificantly different w^ays from less successful students? What kinds of 
misconceptions did the poor performers hold and how might these be 
related to deep misunderstanding of what was taught? What kinds of 
errors did poor performers make? Where in the process of completing a 
task did students have difficulty? This ongoing feedback about how 
students are completing a task provides valuable information about how 
to help students improve. 
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Profiles of Performance 

If you are using results from formal assessments for diagnostic purposes, 
they must have two characteristics; (1) a profile, scale, or set of criteria 
that descrihes component and process aspects of performance and (2) 
valid reasons or a theoretical framework that supports the relationship 
between the task components or processes and outcomes. When you 
have task criteria based on theoretically sound principles, you can 
revi(nv student performance profiles to identify areas of relative 
stnmgths and weaknesses — for individuals, groups, the class as a whole, 
the school, and so forth. For example. Figure 7.6 illustrates the strengths 
and weaknesses of Mike's history essay on the Lincoln-Douglas debate 
by graphing his scores on six dimensions along with the theoretical 
pcM'formancti of an expert in history, derived from previous research at 
CRESST (Baker et al. 1992). Figure 7.6 suggests that compared to the 
history ex])ert, Mike incorporated little overall prior knowledge and few 
historical principles in his essay, relied too heavily on a recently read 
text, constructed a relatively poor argument, and revealed several mis- 
conce})tions. 

When using assessment for diagnostic purposes, you want to keep in 
mind the relationship among the performance subscales and (jverall 
cjuality of performance. Your role as a diagnostician resembles that of a 
behavioral scientist; you are generating testable assumptions about cause 
and effect. What is the difference in the profiles of high- versus low-per- 
forming students? Which dimensions of performance seem to be most 
crucial if we want students to improve? How are the different dimen- 
sions related? Which should be taught first? For example, if your consis- 
tently excellent debaters have profiles that are uniformly high in 
"reference to factual information," "use of real-life examples," and "use 
of humor," then you would want to took at the low-performing students' 
profiles and see on which of these dimensions they were weakest. If you 
find that the poor debaters use humor and refer to real life in their 
arguments but are ^ eak in ihe use of supporting facts, then you could 
begin to improve th .r performances by working on this skill. 

At the school or district level, when we wish to strengthen instruc- 
tion, our focus is on group performance rather than individuals. When 
reviewing group results, look at subgroups as well as subscale perform- 
ance. Classroom and school level summaries often mask different kinds 
of prior knowledge and experiences of identifiable subgroups such as 
boys, girls, students new to the school, non-native English speakers, 
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Figure 7.6 

Expert and Student Score Profiles for History Essays 
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students enrolled in certain coursas, and so on. For example, Figure 7.7 
illustrates profiles of history essay performance for boys and girls. 
Performance on six scale,, is shown, and it may be noted that girls scored 
higher than boys on all scales, although the difference is greater on some 
scales than on others. Assuming that we have ruled out differences due 
to rater bias, what might such subgroup differences mean for instruc- 
tional decision making? 

If you wish to reach all students, you will want to know if some 
subgroups of students have different profiles from others. For example, 
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Figure 7.7 

Profiles of History Essay Performance for Boys and Girls 




do boys and girls who score high in mathematics problem solving get 
this high score in the same way? Or, among the group of low-performing 
essay writers in your school, do students new to the school have different 
instructional needs from those enrolled for three or more years? Among 
the '*barely failing" and "barely passing" scores, do we find similar or 
different performance profiles? Are both borderline groups similar in 
their ratings on grammar and language mechanics? Is there one perform- 
ance dimension that separates these borderline groups, such as "organi- 
zation," that can provide a focus for remedial instruction? The point here 
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is that when you look at group results, you don't always get good 
diagnostic guidelines. Not only do you need to know in which areas 
low-performing students need more instruction, you also need to know 
who these students are. 

Just as your class average may not reveal the fact that two or three 
students were unable to do the task at all, group summaries may give a 
false impression that all students are performing near the same level. 
Part of your diagnostic mission is to find out which students or groups 
are not adequately reflected in the summary and provide appropriate 
summaries of their performances. 



The Use of Assessment Systems: Portfolios as a Case in Point 

Given the limitations of using one assessment task or testing occasion to 
generalize about an individual student, classroom, or school, we suggest 
you use several tasks or occasions tc gather information about a student 
prior to making high-stakes decisions. A longitudinal approach to as- 
sessment puts the results of any one assessment into perspective. At the 
same time, multiple measures of the same outcomes provide alternative 
views of performance that combine to create a more complete picture of 
student achievement. 

Many teachers have turned to portfolio assessment as a strategy for 
creating a classroom assessment system that includes multiple measures 
taken over time. Portfolios have the advantage of containing several 
samples of student work assembled in a purposeful manner. Well-con- 
ceived portfolios include pieces representing both work in progress and 
"showpiece" samples, student reflection about their work, and evalu- 
ation criteria. Arter and Spandel (1992) sum.marize the kinds of concerns 
teachers should keep in mind when using portfolios or other compre- 
hensive assessment systems: 

1. Hov/ representative is the work included in the portfolio of what 
students can really do? 

2. Do the portfolio pieces represent coached work? Iiidependent 
work? Group work? Are they identified as to the amount of support 
students received? 

3. Do the evaluation criloria for each piece and the portfolio as a 
whole represent the most relevant or useful dimensions of student 
work? 

4. How well do portfolio pieces match important instructional tar- 
gets or authentic tasks? 
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5. Do tasks or some parts of them require extraneous abilities? 

6. Is there a method for ensuring that portfolios are reviewed consis- 
tently and criteria applied accurately? 

Test Use: The First and Last Step in Alternative Assessment 

Throughout this chapter we have discussed test use as though it were 
the end product of the test development cycle. But it's clear that unless 
test use is considered before the purchase or development of an assess- 
ment, it is virtually impossible to get the information you really need. 
Assessment, like instruction, requires the simultaneous consideration of 
many issues. 

In this book, we have raised the major conceptual, if not all the 
technical, issues in alternative assessment. Our list is long but certainly 
not exhaustive. The field of alternative assessment is evolving so rapidly 
that today's canons are tomorrow's caveats. 

Creating and using performance assessments effectively can be com- 
plicated. If this is your first introduction to it, try to absorb the big ideas 
first. Your assessments will probably improve, and over time the details 
will become morn approachable as you become more comfortable with 
the concepts ann language. Because it's an iterative process, you will 
revisit issues, each time with increased experience and understanding. 

We hope that this guide will help you cut your way through the 
thicket of ever-increasing alternative assessment information so that you 
can find a clear pathway to more instructionally sensitive, powerful, 
equitable, and useful assessment. 
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